12
ARTICLE OPEN ACCESS Accuracy of identifying incident stroke cases from linked health care data in UK Biobank Kristiina Rannikm¨ ae, PhD,* Kenneth Ngoh,* Kathryn Bush,* Rustam Al-Shahi Salman, PhD, Fergus Doubal, PhD, Robin Flaig, David E. Henshall, Aidan Hutchison, John Nolan, Scott Osborne, Neshika Samarasekera, PhD, Christian Schnier, PhD, Will Whiteley, PhD, Tim Wilkinson, Kirsty Wilson, Rebecca Woodfield, PhD, Qiuli Zhang, PhD, Naomi Allen, PhD, and Cathie L.M. Sudlow, PhD Neurology ® 2020;95:e697-e707. doi:10.1212/WNL.0000000000009924 Correspondence Dr. Rannikm¨ ae kristiina.rannikmae@ ed.ac.uk Abstract Objective In UK Biobank (UKB), a large population-based prospective study, cases of many diseases are ascertained through linkage to routinely collected, coded national health datasets. We assessed the accuracy of these for identifying incident strokes. Methods In a regional UKB subpopulation (n = 17,249), we identied all participants with 1 code signifying a rst stroke after recruitment (incident stroke-coded cases) in linked hospital admission, primary care, or death record data. Stroke physicians reviewed their full electronic patient records (EPRs) and generated reference standard diagnoses. We evaluated the number and proportion of cases that were true-positives (i.e., positive predictive value [PPV]) for all codes combined and by code source and type. Results Of 232 incident stroke-coded cases, 97% had EPR information available. Data sources were 30% hospital admission only, 39% primary care only, 28% hospital and primary care, and 3% death records only. While 42% of cases were coded as unspecied stroke type, review of EPRs enabled a pathologic type to be assigned in >99%. PPVs (95% condence intervals) were 79% (73%84%) for any stroke (89% for hospital admission codes, 80% for primary care codes) and 83% (74%90%) for ischemic stroke. PPVs for small numbers of death record and hemorrhagic stroke codes were low but imprecise. Conclusions Stroke and ischemic stroke cases in UKB can be ascertained through linked health datasets with sucient accuracy for many research studies. Further work is needed to understand the ac- curacy of death record and hemorrhagic stroke codes and to develop scalable approaches for better identifying stroke types. RELATED ARTICLE Comment Can we trust administrative codes for identication of stroke cases? Page 250 *These authors contributed equally to this work. From the Centre for Medical Informatics, Usher Institute of Population Health Sciences and Informatics (K.R., K.B., R.F., A.H., J.N., C.S., T.W., K.W., Q.Z., C.L.M.S.), and Centre for Clinical Brain Sciences (R.A.-S.S., F.D., N.S., W.W., R.W.), University of Edinburgh; UK Biobank (K.R., K.B., R.F., A.H., J.N., C.S., T.W., K.W., R.W., Q.Z., N.A., C.L.M.S.), Stockport; University of Edinburgh Medical School (K.N., D.E.H., S.O.); and Nuffield Department of Population Health (N.A.), University of Oxford, UK. Go to Neurology.org/N for full disclosures. Funding information and disclosures deemed relevant by the authors, if any, are provided at the end of the article. The Article Processing Charge was funded by UK Biobank. This is an open access article distributed under the terms of the Creative Commons Attribution License 4.0 (CC BY), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Copyright © 2020 The Author(s). Published by Wolters Kluwer Health, Inc. on behalf of the American Academy of Neurology. e697

Accuracyofidentifyingincidentstrokecasesfrom linked health ... · UK Biobank (UKB) is a prospective population-based cohort study with extensive phenotypic and genotypic information

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Accuracyofidentifyingincidentstrokecasesfrom linked health ... · UK Biobank (UKB) is a prospective population-based cohort study with extensive phenotypic and genotypic information

ARTICLE OPEN ACCESS

Accuracy of identifying incident stroke cases fromlinked health care data in UK BiobankKristiina Rannikmae PhD Kenneth Ngoh Kathryn Bush Rustam Al-Shahi Salman PhD

Fergus Doubal PhD Robin Flaig David E Henshall Aidan Hutchison John Nolan Scott Osborne

Neshika Samarasekera PhD Christian Schnier PhD Will Whiteley PhD Tim Wilkinson Kirsty Wilson

Rebecca Woodfield PhD Qiuli Zhang PhD Naomi Allen PhD and Cathie LM Sudlow PhD

Neurologyreg 202095e697-e707 doi101212WNL0000000000009924

Correspondence

Dr Rannikmae

kristiinarannikmae

edacuk

AbstractObjectiveIn UK Biobank (UKB) a large population-based prospective study cases of many diseases areascertained through linkage to routinely collected coded national health datasets We assessedthe accuracy of these for identifying incident strokes

MethodsIn a regional UKB subpopulation (n = 17249) we identified all participants with ge1 codesignifying a first stroke after recruitment (incident stroke-coded cases) in linked hospitaladmission primary care or death record data Stroke physicians reviewed their full electronicpatient records (EPRs) and generated reference standard diagnoses We evaluated the numberand proportion of cases that were true-positives (ie positive predictive value [PPV]) for allcodes combined and by code source and type

ResultsOf 232 incident stroke-coded cases 97 had EPR information available Data sources were30 hospital admission only 39 primary care only 28 hospital and primary care and 3death records only While 42 of cases were coded as unspecified stroke type review of EPRsenabled a pathologic type to be assigned in gt99 PPVs (95 confidence intervals) were 79(73ndash84) for any stroke (89 for hospital admission codes 80 for primary care codes) and83 (74ndash90) for ischemic stroke PPVs for small numbers of death record and hemorrhagicstroke codes were low but imprecise

ConclusionsStroke and ischemic stroke cases in UKB can be ascertained through linked health datasets withsufficient accuracy for many research studies Further work is needed to understand the ac-curacy of death record and hemorrhagic stroke codes and to develop scalable approaches forbetter identifying stroke types

RELATED ARTICLE

CommentCan we trust administrativecodes for identification ofstroke cases

Page 250

These authors contributed equally to this work

From the Centre for Medical Informatics Usher Institute of Population Health Sciences and Informatics (KR KB RF AH JN CS TW KW QZ CLMS) and Centre forClinical Brain Sciences (RA-SS FD NS WW RW) University of Edinburgh UK Biobank (KR KB RF AH JN CS TW KW RW QZ NA CLMS) Stockport Universityof Edinburgh Medical School (KN DEH SO) and Nuffield Department of Population Health (NA) University of Oxford UK

Go to NeurologyorgN for full disclosures Funding information and disclosures deemed relevant by the authors if any are provided at the end of the article

The Article Processing Charge was funded by UK Biobank

This is an open access article distributed under the terms of the Creative Commons Attribution License 40 (CC BY) which permits unrestricted use distribution and reproduction in anymedium provided the original work is properly cited

Copyright copy 2020 The Author(s) Published by Wolters Kluwer Health Inc on behalf of the American Academy of Neurology e697

Stroke is the second commonest cause of death worldwideand a major global cause of disability1 Very large prospectivepopulation-based studies are needed to improve our un-derstanding of its risk factors and causal associations2

UK Biobank (UKB) is a prospective population-based cohortstudy with extensive phenotypic and genotypic information ongt500000 participants from England Scotland and Wales(ukbiobankacuk) It is an open-access resource established tofacilitate research into the determinants of a wide range of healthoutcomes particularly those of relevance inmiddle and older age3

A cost-effective way of following UKB participants for diseaseoutcomes is via linkages to routinely collected coded nationaladministrative health datasets UKB receives regularly upda-ted linkages to national hospital admission and death recorddata for all participants and to primary care data for a largeand increasing subset However appropriate use of these datain research studies requires understanding of their accuracyIn a systematic review of published studies we found thatstroke-specific diagnosis codes in hospital admission anddeath record data generally have good accuracy However thestudies varied widely in their settings and methodology withvery limited data about the accuracy of primary care codes orthe effect of combining different data sources4

We therefore conducted a validation study to assess the ac-curacy of ascertainment of incident stroke cases in UKB vialinked coded national administrative health datasets (in-cluding primary care data) compared with diagnoses assignedfollowing adjudication by clinical experts with access to thefull free-text electronic patient records (EPRs)

MethodsStudy populationWe conducted the study in a subpopulation of 17249 UKBparticipants in the Lothian region of southeast Scotland all ofwhom are linked to national administrative health datasets (in-cluding hospital primary care and death record codes) Thisregion encompasses the city of Edinburgh where one of UKBrsquosrecruitment centers was located Within this subpopulation weidentified all those with at least one code in their linked health datathat indicated a stroke diagnosis after their recruitment to UKB

Stroke definitionWe defined stroke according to the WHO definition ldquorapidlydeveloping clinical signs of focal (or global) disturbance of ce-rebral function lasting more than 24 hours or leading to deathwith no apparent cause other than that of vascular originrdquo5

Code selection and sourcesWe selected and included relevant codes from hospital ad-mission primary care and death records up to the end ofSeptember 2015 the date at which data were complete for allsources at the time of this study

In the United Kingdom hospital admissions and death recordsare coded by specialized medical coders who use the In-ternational Classification of Diseases (ICD currently version10) coding system to assign a primary code to the main con-dition resulting in a particular admitted episode of care andsecondary codes to other conditions Primary care data arecoded by general practice clinical or administrative staff duringclinical encounters or on receipt of information from elsewhere(eg hospital inpatient and specialist outpatient settings)

Primary care data in the United Kingdom currently use theRead coding system (version 2 in Scotland during the timeperiod of relevance in this study)

We selected relevant stroke codes from the ICD-10 and Readversion 2 coding systems aiming to ascertain cases of stroke witha high positive predictive value (PPV) (ie to minimize theproportion of false-positives among the stroke cases) whileaiming to ascertain as many as possible of the true cases(ie optimizing sensitivity) Our code selections were informedby the results of a systematic review4 browsing of theTechnologyReference Data Update Distribution Service (isdhscicgovuktrud3userguestgroup0home) the Secure Anonymised In-formation Linkage databank cerebrovascular diseases dictionary6

the Quality Outcomes Framework indicator sets for stroke(wwwwalesnhsuk) and the CALIBER online data portal(wwwcaliberresearchorg) as well as manual review of code listsby experts (table e-1 dryadorg105061dryadw9ghx3fk0)

We only validated diagnoses for cases with their first incidentstroke defined as the first-ever-in-a-lifetime occurrence ofa WHO-defined stroke according to codes from linked na-tional administrative datasets We excluded participants whoself-reported a stroke at baseline recruitment or who hada stroke code in their linked health care data predating re-cruitment to UKB This was because (1) incident diseasecases (ie new-onset disease arising in those without a historyof the relevant condition) are of most research interest ina prospective population-based study recruiting mainlyhealthy volunteers and (2) a large proportion of stroke casesidentified through codes or self-reported stroke status arisingprior to recruitment (prevalent cases) predated the EPRsystem used for validation We based our analyses on theearliest code(s) for each participant from any data source

GlossaryCI = confidence interval EPR = electronic patient record ICD = International Classification of Diseases ICH = intracerebralhemorrhage PPV = positive predictive value SAH = subarachnoid hemorrhage UKB = UK Biobank

e698 Neurology | Volume 95 Number 6 | August 11 2020 NeurologyorgN

Process of assigning an expert diagnosisFor each included participant with a stroke code we extractedanonymized and created a single document (hereafter referredto as a ldquovignetterdquo) from all available relevant medical in-formation from the secondary care EPR system includingoutpatient clinic letters hospital discharge summaries andformal radiologic investigation reports (figure e-1 dryadorg105061dryadw9ghx3fk0) We considered information rele-vant if it was associated with a hospital admission leading to thecode or if it was within 2 months before or after the date ofa relevant primary care code A 2-month windowwas chosen toreflect local clinical practice during the time period of the studypatients in NHS Lothian with a suspected stroke or TIA notrequiring hospital admission would have been referred to theneurovascular specialist outpatient clinic where they wouldusually be seen within 1 week in keeping with nationalguidelines (the National Institute for Health and Care Excel-lence guidance wwwniceorgukguidancecg68 and theScottish Stroke Care Audit Standards strokeauditscotnhsukQualityScottish_Stroke_Care_Standardshtml)

Following this specialist clinic visit correspondence to thepatientrsquos general practitioner would generally appear on thesecondary care EPR systemwithin days to a fewweeks providingthe clinical information required to validate primary care codes

We developed an adjudication form which included a sum-mary for adjudicators of criteria for the diagnosis of stroke itspathologic types (ischemic stroke intracerebral hemorrhage[ICH] subarachnoid hemorrhage [SAH]) and subtypes(appendix e-1 dryadorg105061dryadw9ghx3fk0)

Six consultant stroke specialists (hereafter referred to asldquoadjudicatorsrdquo) completed adjudication forms based on theinformation in the EPR-derived vignettes They did not haveaccess to review the brain imaging themselves but formal scanreports were included in the vignettes One adjudicator (KR)completed the forms for all participants and 5 other adjudi-cators (NS CS RA-SS FD WW) independentlycompleted the forms (each for a subset) to allow assessmentof interadjudicator agreement All adjudicators were blindedto which specific codes had led to ascertainment of any caseand to each otherrsquos diagnoses The first adjudicatorrsquos di-agnoses were used as the reference standard for the primaryanalyses in this study

Data analyses

Participantsrsquo characteristicsWe summarized participantsrsquo sex and median age at re-cruitment and at the time the code was assigned

Code source and type identificationWe calculated the number and proportion of stroke-codedparticipants arising from each data source We also stratifiedthe codes into those specifying a stroke type (ischemic vs ICHvs SAH) vs those indicating an unspecified type of stroke and

calculated the numbers and proportions of these arising fromeach source We compared these findings with the sameanalyses for the larger subset of all UKB participants withlinkage to the relevant data sources in England Scotland andWales

Assessment of interadjudicator agreementWe used percent agreement and Cohen kappa (ĸ)7 to mea-sure interadjudicator agreement comparing one adjudicatorrsquos(KR) diagnoses with a second (one of NS CS RA-SSFD WW) for the following stroke vs not strokeTIA vsnot and in cases where both adjudicators agreed that thediagnosis was stroke ischemic stroke vs ICH vs SAH vs un-certain stroke type

Calculating code accuracyThe main measure of accuracy assessed in this study was PPV(the proportion of all stroke-coded cases ascertained that weretrue-positive cases) While we could not directly assess sensi-tivity we were able to assess the effects on both PPV and thenumber of true-positive cases (a higher number of true-positives indicates higher sensitivity) of different code combi-nation strategies for ascertaining cases We categorized thestroke-coded cases as true-or false-positives based on adjudi-catorsrsquo diagnoses We calculated the PPV and its 95 confi-dence interval (CI) using the Clopper Pearson exact methodconsidering both (1) only an adjudicator diagnosis of strokea true-positive and (2) an adjudicator diagnosis of stroke orTIA a true-positive To understand the contribution of eachsource of codes we calculated PPVs for each code sourceseparately (hospital admission vs primary care vs death recorddata) and for their combinations and explored the effect (bothon PPV and on true-positive case numbers) of including onlycodes in the primary position for hospital admission data

We also calculated PPVs for ischemic and hemorrhagic stroketypes and explored the effect (both on PPV and on true-positive case numbers) of restricting hospital data to codes inthe primary position and of treating all unspecified stroke typecodes as ischemic stroke codes

In further analyses we assessed administrative vs overall codeaccuracy and the effect on resulting PPVs as we hypothesizedthat this may explain some of the variability of the resultsamong previous validation studies Some validation studies arebased on the assumption that if the diagnostic code reflects thediagnosis mentioned somewhere in the EPR (often for ex-ample the discharge summary for a particular hospital admis-sion) then this is sufficient to confirm the accuracy of thecode4 We refer to this as the administrative accuracy ofthe code Clinicians will however appreciate that assessing theaccuracy of a diagnostic code is more nuanced A patientrsquos truediagnosis may emerge over time after encounters with severaldifferent clinicians with varying levels of expertise and withaccess to different amounts of relevant information hence theEPR can contain inconsistencies which may require specialistclinical knowledge to detect andor resolve Hence a specialist

NeurologyorgN Neurology | Volume 95 Number 6 | August 11 2020 e699

physicianrsquos review of all relevant parts of the complete EPR willgive a more accurate picture of the true diagnosis and so of thecode accuracy which we refer to as the overall accuracy Weassessed administrative accuracy by having a primary carephysician in our team review the EPR-derived vignettes formentions of stroke diagnoses We assessed overall accuracy byhaving expert adjudicators (consultant level stroke specialists)review the EPR-derived vignettes and derive reference standarddiagnoses based on all available information and their clinicalinterpretation of this information (see Process of assigning anexpert diagnosis) We then assessed whether and by howmuch the administrative vs overall accuracy (PPV) differed forour selected stroke codes

We performed all statistical analyses in R (r-projectorg)

Further exploration of false-positive codesWe compared characteristics of participants with a true- vsa false-positive code and reviewed the vignettes of false-positive cases considering their alternative diagnoses andassessing possible reasons for them being assigned a strokecode Informed by these assessments we evaluated alternativecode combinations that we hypothesized may improve codeaccuracy

Stroke type and subtype distributionsWe further assessed the proportion of true-positive cases thatcould be assigned a pathologic stroke type (ischemic strokeICH SAH) and more detailed subtype by the stroke specialistadjudicator and their frequency distributions

Standard protocol approvals registrationsand patient consentsAll procedures performed in studies involving human partic-ipants were in accordance with the ethical standards of theinstitutional andor national research committee and with the1964 Helsinki declaration and its later amendments or com-parable ethical standards As part of the UKB recruitmentprocess informed consent was obtained from all individualparticipants included in the study

Data availabilityAnonymized summary data based on this study not publishedwithin this article can be shared (subject to approval fromUKB) on request from any qualified investigator unless lim-ited by ethical or legal restrictions Any bona fide researchercan access the UKB resource for research that aims to benefitthe publicrsquos health (see ukbiobankacuk)

ResultsStudy populationIn the subpopulation of 17249 participants 232 had a relevantstroke code occurring after recruitment to UKB Of these 225(97) had available information in their EPR to create a vi-gnette and were included in the study analyses (figure 1)

Participantsrsquo characteristicsOf the 225 stroke-coded cases 111 (49) were female Me-dian age at recruitment to UKB was 63 years (range 41ndash70

Figure 1 Selection of included UK Biobank (UKB) participants

GP = general practitioner NHS = National Health Service

e700 Neurology | Volume 95 Number 6 | August 11 2020 NeurologyorgN

years) and at time of the stroke code was 67 years (range45ndash76 years) (table 1)

Code sources and typesOf the 225 stroke-coded cases 67 (30) received a code fromhospital admission data only 87 (39) from primary caredata only 64 (28) from both hospital admission and pri-mary care data 6 (3) from death records only and one fromboth hospital admission data and death records (table 2 andfigure e-2 dryadorg105061dryadw9ghx3fk0) For thelarger subset of all UKB participants with linkage to the rel-evant data sources these proportions were similar for Scot-land but a higher proportion of participants in England andWales had hospital-only codes (table 2)

As regards stroke types 131225 (58) of stroke-coded caseshad a stroke type-specific code (ischemic ICH or SAH)while the remaining 94 (42) had unspecified stroke codesThe proportion of cases with unspecified stroke type codeswas higher among those ascertained from primary care (105151 [70]) than from hospital admission data (38132[29]) and all death record codes were stroke-type specific(for these code-source level estimates a case would becounted ge1 if occurring in ge1 source) (table 3 and figure e-2dryadorg105061dryadw9ghx3fk0) Among the largersubset of all UKB participants with relevant linked data theproportion of cases with an unspecified stroke code wassimilar for Scotland but lower for England and Wales (19and 30 respectively) (table 3)

Table 1 Participant characteristics

All coded cases (n = 225) True-positive cases (n = 178) False-positive cases (n = 47)

Female 49 (111225) 47 (84178) 57 (2747)

Recruitment age y 63 (41ndash70) 63 (41ndash70) 64 (46ndash69)

Age at code y 67 (45ndash76) 67 (45ndash76) 67 (52ndash76)

Primary care code only 33 (75225) 26 (47178) 60 (2847)

Unspecified stroke code 42 (86225) 38 (68178) 38 (1847)

Hemorrhagic stroke code 22 (49225) 20 (35178) 30 (1447)

Code date lt2011 34 (76225) 33 (58178) 38 (1847)

Dead by end of follow-up (any cause) 8 (18225) 7 (13178) 11 (547)

Values are (n) or median (range)

Table 2 Code sources in the validation study subpopulation and in the larger subset of all UK Biobank (UKB) participantslinked to relevant health care datasets in Scotland England and Wales

UKB participants ScotlandLothian (n = 17249)

UKB participantsScotland alla (n = 31426)

UKB participantsEngland alla (n = 18494)

UKB participantsWales alla (n = 21346)

Total stroke codes 225 553 197 301

Hospital only codes 30 (67225) 27 (147553) 42 (82197) 43 (129301)

Primary care only codes 39 (87225) 44 (241553) 29 (58197) 39 (116301)

Death record only codes 3 (6225) 1 (7553) 0 (0197) 3 (10301)

Hospital and primary carecodes

28 (64225) 28 (156553) 29 (57197) 14 (43301)

Hospital and death recordcodes

0 (1225) 0 (2553) 0 (0197) 1 (2301)

Primary care and deathrecord codes

0 (0225) 0 (0553) 0 (0197) 0 (0301)

Hospital and primary careand death record codes

0 (0225) 0 (0553) 0 (0197) 0 (1301)

Values are (n)a We included UKB participants with linkage to hospital admissions data death records and primary care data (whereas around half of UKB participants arelinked to primary care data using Read V2 or V3 coding systems we included only practices using the Read V2 coding system for ease of comparison)

NeurologyorgN Neurology | Volume 95 Number 6 | August 11 2020 e701

Interadjudicator agreementAll vignettes were independently reviewed by 2 adjudicators Forassigning a diagnosis of stroke vs not interadjudicator agreementwas 91 and Cohen kappa very good at 07 (95 CI 06ndash08)for strokeTIA vs not agreementwas 93 andCohen kappa verygood at 07 (95 CI 06ndash09) Agreement for assigning a stroketype (ischemic stroke vs ICH vs SAH vs uncertain stroke type)was 99 and Cohen kappa excellent at 098 (95 CI 09ndash1)

Code accuracyThe overall PPV of case ascertainment for all 225 stroke-coded cases from all sources combined was 79 (95 CI73ndash84) When broken down by code source PPV washighest for the 132 cases with hospital admission codes (8995 CI 82ndash94) and higher still when limiting analyses toprimary position hospital codes only (94 95 CI88ndash98) at the expense of losing a few (lt10) true-positive cases PPV for the 151 cases with primary care codeswas 80 (95 CI 72ndash86) Only 7 cases had death recordcodes with PPV 57 (95 CI 18ndash90) the wide CIsindicating limited precision of this estimate For cases withboth a hospital admission and a primary care code PPV wasvery high (97 95 CI 89ndash100) but only a third of thecases fell into this category (figure 2A)

As regards the accuracy of identifying stroke pathologic type(for these estimates a case would be counted in ge1 analysis if ithad ge1 unique stroke type code) PPV among the 88 partic-ipants with an ischemic stroke code was 83 (95 CI74ndash90) restricting to primary position codes (for thosecases from hospital admission data) increased this to 86 (95CI 76ndash93) with loss of lt10 of true-positive cases (figure2B) Given that ischemic stroke is the most common patho-logic stroke type and hence an unspecified stroke code is muchmore likely to signify an ischemic rather than a hemorrhagic

stroke case we calculated the PPV for ischemic stroke of the184 ischemic and unspecified stroke codes combined Whilethis resulted in a slightly lower PPV of 80 (95 CI73ndash85) it approximately doubled the number of true-positive ischemic stroke cases identified The proportion oftrue-positive ischemic stroke cases among all cases with anunspecified code was 77 Accuracy of hemorrhagic strokecodes was lower but the small numbers of coded cases resultedin limited precision PPV among 26 participants with an ICHcode was 42 (95 CI 23ndash63) and among 24 participantswith a SAH code was 71 (95 CI 49ndash87) Restricting toprimary position codes increased the PPV of both ICH andSAH codes without losing any true-positive cases

PPV when measuring only administrative accuracy was 93 de-creasing to 79 when measuring overall accuracy This demon-strates the importance of expert-led adjudication in validationstudies as ignoring it would lead to falsely inflated PPVs (figure 3)

Further exploration of false-positive codesThe 47 false-positive stroke-coded cases had a similar pro-portion of unspecified codes and median age (at date of re-cruitment or first stroke code) as true-positives False-positivecases were more likely to have a primary care code and to havedied by the end of follow-up and were slightly more likely tobe female to have a hemorrhagic stroke code and to have anearlier first stroke code date (table 1)

Alternative diagnoses for the 47 false-positives included TIA(1447 30) secondary intracranial bleed not due to SAH orICH (eg traumatic tumor-related bleeding into a cerebralinfarct) (1147 [23]) radiologic finding of a suspected oldvascular lesion (usually incidental) (847 [17]) and pos-sible or probable alternative diagnosis (eg migraine de-myelination seizure or another diagnosis) (1447 [30])

Table 3 Proportion of cases with unspecified vs specified stroke type codes in the validation study subpopulation and inthe larger subset of all UKBiobank (UKB) participants linked to relevant health care datasets in Scotland Englandand Wales

UKB participants ScotlandLothian (n = 17249)

UKB participantsScotland alla (n = 31426)

UKB participants Englandalla (n = 18494)

UKB participants Walesalla (n = 21346)

Total stroke codes 225 553 197 301

Unspecified codes inhospital data

29 (38132) 27 (81305) 7 (10139) 17 (30175)

Unspecified codes inprimary care

70 (105151) 64 (254397) 45 (52115) 50 (80160)

Unspecified codes indeath records

0 (07) 11 (19) na 54 (713)

Unspecified codestotalb

42 (94225) 40 (219553) 19 (37197) 30 (89301)

Values are (n)a We included UKB participants with linkage to hospital admissions data death records and primary care data (whereas around half of UKB participants arelinked to primary care data using Read V2 or V3 coding systems we included only practices using the Read V2 coding system for ease of comparison)b Stroke-coded cases with an unspecified stroke type code (note that if a case has codes from multiple sources including a specified stroke type code inhospital data and an unspecified stroke type code in primary care data or death record data they will count as having a specified code)

e702 Neurology | Volume 95 Number 6 | August 11 2020 NeurologyorgN

Among the 15 participants with a false-positive ICH code 715had another primary stroke type plusmn secondary ICH 415intratumor bleed 315 traumatic or other intracranial bleedand 115 no obvious reason for an ICH code Among the 7participants with a false-positive SAH code 47 were traumaticintracranial bleed 27 another primary stroke type plusmn secondarySAH component and 17 asymptomatic aneurysm

Because performance for hemorrhagic stroke codes wassuboptimal having observed that the more common alter-native diagnoses were traumatic and intratumor bleeds orstroke complicated by a bleed we performed additionalanalyses to explore the effects of different code inclusioncriteria on numbers of cases numbers of true-positives andPPV for ICH and SAH For both ICH and SAH we couldincrease PPV but at the expense of failing to detect some true-positive cases However numbers of cases were too small todraw firm conclusions (figure 4)

Stroke type and subtype distributionsWith information from the EPR a stroke specialist was able toassign a pathologic stroke type to 177 of the 178 true-positivestroke cases 149 (84) were ischemic 11 (6) ICH and 17(10) SAH Depending on the subclassification system usedan ischemic subtype could be determined for 37ndash87 ofcases an ICH subtype for 27ndash100 of cases and SAHsubtype for 94 of cases (figure e-3 dryadorg105061dryadw9ghx3fk0)

DiscussionOur results suggest that stroke cases and cases of ischemicstroke type can be ascertained in UKB through linked codeddata with sufficient accuracy for use in many genetic andepidemiologic research studies without further expert vali-dation since PPVs were generally at least 80 despite

Figure 2 Positive predictive values (PPVs) of stroke codes

PPVs of stroke codes stratified by code source (A) and code type (B) Primary position includes primary care codes where no code position is specified andonly primary position hospital admission codes CI = confidence interval ICH = intracerebral hemorrhage SAH = subarachnoid hemorrhage

NeurologyorgN Neurology | Volume 95 Number 6 | August 11 2020 e703

stringent adjudication criteria Primary care is an importantcode source with ge13 of the cases ascertained exclusivelyvia primary care data in this UK setting Code accuracyappeared slightly better for hospital admission compared toprimary care codes There were insufficient data to draw firmconclusions regarding the accuracy of death record orhemorrhagic stroke type codes Including only the primaryposition codes from hospital admission data increased PPVwithout loss of a substantial proportion of true-positivecases In the subpopulation of Scotland studied only around60 of codes were specific for a stroke type however usingall available relevant medical information from the EPR anexpert adjudicator could assign a stroke type to 99 anda more detailed stroke subtype to between a third and almost90 of cases depending on the subclassification systemused A higher proportion of participants had a specifiedstroke type code in the English and Welsh data but theaccuracy of these codes needs further investigation We alsodemonstrated that at least in this setting considering onlythe administrative accuracy of stroke diagnosis codes (ratherthan overall accuracy based on an expert review of the fullEPR) may give falsely inflated PPVs

Acceptable levels of accuracy and the relative importanceof different accuracy metrics depend on the context8 UKBis primarily used for research into the genetic and non-genetic determinants of disease3 In such analyses it isimportant to ensure that a high proportion of participantsidentified as disease cases truly do have the disease (high

PPV) to minimize bias in effect estimates while aiming tooptimize statistical power by ascertaining as many true-positive cases as possible (optimizing sensitivity but notnecessarily maximizing it as this may compromise thePPV) A high specificity (the proportion of participantswithout the disease that do not receive a stroke code) iscrucial in obtaining a high PPV but is not sufficient in andof itself In population-based prospective cohorts where theproportion of all participants who are true-positives forthe disease (in this case stroke) is generally low the pro-portion of all participants incorrectly classified as havingstroke (false-positives) will generally be low (giving highspecificity) even if the absolute number of false-positives ishigh compared to the absolute number of true-positives(low PPV)8 Providing appropriate codes are used boththe specificity and negative predictive value of routinelycollected health care data to identify disease casesin population-based studies are usually very high(96ndash100)4 For these reasons we focused on estimatingthe PPV of using routinely collected health care data toidentify stroke cases in UKB and on assessing the effects ofdifferent code and source selections on both PPV andnumber of true-positive cases

Strengths of this study include (1) the overall large number ofparticipants (2) inclusion of primary care and death recordcodes for which accuracy data have previously been limited(3) creation of vignettes using preset criteria from the EPR for97 of otherwise eligible stroke-coded cases thus avoiding

Figure 3 Assessing administrative vs overall accuracy

High clinical certainty mentions of stroke ldquostrokerdquo ldquoprobable strokerdquo ldquopresumptive strokerdquo ldquoconsistent with strokerdquo ldquocompatible with strokerdquo ldquolikely strokerdquoldquotreated as strokerdquo or equivalent stroke terms (ICH SAH intracerebral hemorrhage subarachnoid hemorrhage ischemic stroke infarct) Medium clinicalcertainty mentions of stroke including above plus ldquopossible strokerdquo ldquosuspected strokerdquo ldquoimpression of strokerdquo ldquosuggestive of strokerdquo ldquoquery strokerdquo orequivalent stroke terms (ICH SAH intracerebral hemorrhage subarachnoid hemorrhage ischemic stroke infarct) Low clinical certainty mentions of strokeinclude above plus ldquoTIArdquo and ldquotransient ischaemic attackrdquo with any level of certainty preceding it The hierarchy of clinical certainty was based on the ICD-10clinical coding instruction manual (isdscotlandorgProducts-and-ServicesTerminology-ServicesClinical-Coding- Guidelines April 2010 version) which isused by the coding departments in UK hospitals CI = confidence interval PPV = positive predictive value

e704 Neurology | Volume 95 Number 6 | August 11 2020 NeurologyorgN

the bias of selecting participants with a higher prior proba-bility of having had a stroke (4) blinding of adjudicators tocodes and each otherrsquos diagnoses (5) assessment of inter-adjudicator reliability (6) further analysis of false-positivecodes and (7) assessment of both administrative and overallaccuracy demonstrating the importance of an expert-ledreference standard in validation studies

There are some limitations (1) small numbers precludedrobust conclusions about the accuracy of death record andhemorrhagic stroke codes (2) our lack of access to the fullfree text EPRs held exclusively by the primary care systemmay have led to slightly more conservative PPV estimates(although we would not expect this to be a major issue forstroke) as almost all stroke cases have acute inpatient oroutpatient management in secondary care (in keeping withthis is that we were able to find secondary care records for97 of the cases) and information provided from primaryto secondary care was available in the hospital EPR (3) thepotential difficulty in differentiating between definite false-positive and uncertain cases from retrospective review ofmedical records which may have further reduced the es-timated PPVs and (4) the restriction of the current studyto a Scottish subpopulation of the UKB participants (al-though we have no reason to suspect substantial variationin the results considering that the code generation processacross the United Kingdom follows similar pathways andgiven the broadly similar distribution of code sources in

the wider UKB population across England Scotland andWales)

The present study is limited to validating the accuracy of codesfor symptomatic stroke (as defined by the WHO) Subclinicalcerebrovascular disease (such as that detected by brain imaging)which is more common particularly in older people9 would notbe readily ascertained by the linked coded health care datasources used to follow the health of UKB participants but can bedetected through brain imaging conducted as part of the UKBmultimodal imaging study of 100000 of its participants

These results are largely consistent with those from our earliersystematic review4 where PPVs for stroke-specific ICD-10hospital admission codes ranged from 79 to 831011 andfor ischemic stroke from 86 to 88101213 Our results forICH and SAH code accuracy among a small number of par-ticipants were worse than those from the 2 previous UnitedKingdomndashbased studies of ICD-10 or Read codes1314 Thismay be in part because these studies checked only (or mainly)for administrative accuracy hence inflating the PPVs Ourresults were similar to a Danish study which also used anexpert review of medical records to check for overall accu-racy10 To our knowledge there are no previous validationstudies of stroke-specific UK primary care Read codes ordeath record codes for diagnosis of all types of stroke nor ofthe effect of combining primary care and hospital admissioncoded data

Figure 4 Exploratory analyses to improve accuracy of hemorrhagic stroke codes

Excluding other stroke code sameday excluded cases with a diagnostic code for gt1 stroke pathologic type on the sameday This was done because a patientwho has one pathologic stroke type (eg ischemic stroke) can sometimes develop a complication and subsequent brain scan appearances similar to anotherpathologic stroke type (eg a patient with ischemic stroke can have a bleed in the brain as a result of the ischemic stroke which could lead to a false diagnosisof an intracerebral hemorrhage [ICH]) CI = confidence interval PPV = positive predictive value SAH = subarachnoid hemorrhage

NeurologyorgN Neurology | Volume 95 Number 6 | August 11 2020 e705

Further work will be required to further explore the accuracyof hemorrhagic stroke and death record codes in largernumbers of cases by expanding this work to other UKB re-cruitment locations which will also enable assessment of thegeneralizability of these results to other regions of the UnitedKingdom In our Scottish subpopulation while over one-thirdof the codes were not specific for a stroke type expert adju-dication allowed a stroke type to be assigned for almost allcases Future studies should investigate ways of automatingthis process as well as methods for determining more detailedstroke subtypes

Our results suggest that stroke and ischemic stroke cases inUKB can be identified through linked coded data withsufficient accuracy for many genetic and epidemiologicstudies while there are currently insufficient data to drawdefinite conclusions about the accuracy of hemorrhagicstroke codes

AcknowledgmentThe authors thank the UK Biobank participants UK Biobankscientific project and data management teams in OxfordStockport and Edinburgh and the UK Biobank strokeoutcomes working group (Oxford University Valerie BeralYiping Chen Zhengming Chen Jane Green Mary KrollSarah Lewington Peter Rothwell Lucy Wright EdinburghUniversity Martin Dennis Joanna Wardlaw Sarah Wild)This research was awarded the University of EdinburghArthur Fonville prize

Study fundingThis work was supported by salary from UK Biobank andHealth Data Research UK (HDRUK) fellowship (KR)salary from Dementias Platform UK (DPUK) clinical fel-lowship (KB) salary from Medical Research Council Clin-ical Research Training Fellowship (MRP0018231) (TW)salary from UK Biobank and DPUK (RF CS) salary fromUK Biobank (KW RW QZ) and salary from UK Bio-bank Scottish Funding Council DPUK and HDRUK(CLMS) (Chief Scientist for UK Biobank) Project man-agement and administration funding provided by UKBiobank

DisclosureK Rannikmae K Ngoh K Bush R Salman F Doubal RFlaig D Henshall A Hutchison J Nolan S Osborne NSamarasekera C Schnier W Whiteley T Wilkinson KWilson R Woodfield and Q Zhang report no disclosuresrelevant to the manuscript N Allen was senior epidemiologistwhen this study was conducted C Sudlow was chief scientistfor the UK Biobank study when this study was conducted Goto NeurologyorgN for full disclosures

Publication historyReceived by Neurology July 4 2019 Accepted in final formJanuary 27 2020

Appendix Authors

Name Location Contribution

KristiinaRannikmaePhD

UniversityofEdinburgh

Substantial contributions to theconception and design of the workacquisition analysis andinterpretation of data

Kenneth Ngoh UniversityofEdinburgh

Substantial contributions toacquisition analysis andinterpretation of data

Kathryn Bush UniversityofEdinburgh

Substantial contributions to theanalysis and interpretation of data

Rustam Al-Shahi SalmanPhD

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

Fergus DoubalPhD

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

Robin Flaig UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

David EHenshall

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

AidanHutchison

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

John Nolan UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

Scott Osborne UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

NeshikaSamarasekeraPhD

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

ChristianSchnier PhD

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

Will WhiteleyPhD

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

Tim Wilkinson UniversityofEdinburgh

Substantial contributions to theconception and design of the work andinterpretation of data

Kirsty Wilson UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

RebeccaWoodfield PhD

UniversityofEdinburgh

Substantial contributions to theconception of the work andinterpretation of data

Qiuli ZhangPhD

UniversityofEdinburgh

Substantial contributions to theconception and design acquisitionanalysis and interpretation of data

Naomi AllenPhD

Universityof Oxford

Substantial contributions to theacquisition and interpretation of data

Cathie LMSudlow PhD

UniversityofEdinburgh

Substantial contributions to theconception and design of the workacquisition and interpretation of data

e706 Neurology | Volume 95 Number 6 | August 11 2020 NeurologyorgN

References1 Lozano R Naghavi M Foreman K et al Global and regional mortality from 235

causes of death for 20 age groups in 1990 and 2010 a systematic analysis for theGlobal Burden of Disease Study 2010 Lancet 20123802095ndash2128

2 Burton PR Hansell AL Fortier I et al Size matters just how big is BIG quantifyingrealistic sample size requirements for human genome epidemiology Int J Epidemiol2009263ndash273

3 Sudlow C Gallacher J Allen N et al UK Biobank an open access resource foridentifying the causes of a wide range of complex diseases of middle and old age PLoSMed 201512e1001779

4 Woodfield R Grant I UK Biobank Stroke Outcomes Group UK Biobank Follow-Upand OutcomesWorking Group Sudlow CL Accuracy of electronic health record datafor identifying stroke cases in large-scale epidemiological studies a systematic reviewfrom the UK Biobank stroke outcomes group PLoS One 201510e0140533

5 WHO MONICA Project Investigators The World Health Organization MONICAproject (Monitoring Trends and Determinants in Cardiovascular Disease) J ClinEpidemiol 198841105ndash114

6 Ford DV Jones KH Verplancke JP et al The SAIL Databank building a nationalarchitecture for e-health research and evaluation BMC Health Serv Res 20099157

7 Viera AJ Garrett JM Understanding interobserver agreement the kappa statisticFam Med 200537360ndash363

8 Chubak J Pocobelli G Weiss NS Tradeoffs between accuracy measures for electronichealth care data algorithms J Clin Epidemiol 201265343ndash349e2

9 KirkmanMAMahattanakul W Gregson BAMendelow AD The accuracy of hospitaldischarge coding for hemorrhagic stroke Acta Neurol Belg 2009109114ndash119

10 Johnsen SP Overvad K Soslashrensen HT Tjoslashnneland A Husted SE Predictive value ofstroke and transient ischemic attack discharge diagnoses in the Danish NationalRegistry of Patients J Clin Epidemiol 200255602ndash607

11 Krarup LH Boysen G Janjua H Prescott E Truelsen T Validity of stroke diagnosesin a national register of patients Neuroepidemiology 200728150ndash154

12 Haesebaert J Termoz A Polazzi S et al Can hospital discharge databases be used tofollow ischemic stroke incidence Stroke 2013441770ndash1774

13 Wright FL Green J Canoy D et al Vascular disease in women comparison ofdiagnoses in hospital episode statistics and general practice records in England BMCMed Res Methodol 201212161

14 Gaist D Wallander MA Gonzalez-Perez A Garcia-Rodriguez L Incidence of hem-orrhagic stroke in the general population validation of data from the Health Im-provement Network Pharmacoepidemiol Drug Saf 201322176ndash182

NeurologyorgN Neurology | Volume 95 Number 6 | August 11 2020 e707

DOI 101212WNL0000000000009924202095e697-e707 Published Online before print July 2 2020Neurology

Kristiina Rannikmaumle Kenneth Ngoh Kathryn Bush et al Biobank

Accuracy of identifying incident stroke cases from linked health care data in UK

This information is current as of July 2 2020

ServicesUpdated Information amp

httpnneurologyorgcontent956e697fullincluding high resolution figures can be found at

References httpnneurologyorgcontent956e697fullref-list-1

This article cites 13 articles 1 of which you can access for free at

Citations httpnneurologyorgcontent956e697fullotherarticles

This article has been cited by 1 HighWire-hosted articles

Subspecialty Collections

httpnneurologyorgcgicollectionsubarachnoid_hemorrhageSubarachnoid hemorrhage

httpnneurologyorgcgicollectionintracerebral_hemorrhageIntracerebral hemorrhage

httpnneurologyorgcgicollectioninfarctionInfarction

httpnneurologyorgcgicollectioncohort_studiesCohort studies e

httpnneurologyorgcgicollectionall_cerebrovascular_disease_strokAll Cerebrovascular diseaseStrokefollowing collection(s) This article along with others on similar topics appears in the

Permissions amp Licensing

httpwwwneurologyorgaboutabout_the_journalpermissionsits entirety can be found online atInformation about reproducing this article in parts (figurestables) or in

Reprints

httpnneurologyorgsubscribersadvertiseInformation about ordering reprints can be found online

ISSN 0028-3878 Online ISSN 1526-632XWolters Kluwer Health Inc on behalf of the American Academy of Neurology All rights reserved Print1951 it is now a weekly with 48 issues per year Copyright Copyright copy 2020 The Author(s) Published by

reg is the official journal of the American Academy of Neurology Published continuously sinceNeurology

Page 2: Accuracyofidentifyingincidentstrokecasesfrom linked health ... · UK Biobank (UKB) is a prospective population-based cohort study with extensive phenotypic and genotypic information

Stroke is the second commonest cause of death worldwideand a major global cause of disability1 Very large prospectivepopulation-based studies are needed to improve our un-derstanding of its risk factors and causal associations2

UK Biobank (UKB) is a prospective population-based cohortstudy with extensive phenotypic and genotypic information ongt500000 participants from England Scotland and Wales(ukbiobankacuk) It is an open-access resource established tofacilitate research into the determinants of a wide range of healthoutcomes particularly those of relevance inmiddle and older age3

A cost-effective way of following UKB participants for diseaseoutcomes is via linkages to routinely collected coded nationaladministrative health datasets UKB receives regularly upda-ted linkages to national hospital admission and death recorddata for all participants and to primary care data for a largeand increasing subset However appropriate use of these datain research studies requires understanding of their accuracyIn a systematic review of published studies we found thatstroke-specific diagnosis codes in hospital admission anddeath record data generally have good accuracy However thestudies varied widely in their settings and methodology withvery limited data about the accuracy of primary care codes orthe effect of combining different data sources4

We therefore conducted a validation study to assess the ac-curacy of ascertainment of incident stroke cases in UKB vialinked coded national administrative health datasets (in-cluding primary care data) compared with diagnoses assignedfollowing adjudication by clinical experts with access to thefull free-text electronic patient records (EPRs)

MethodsStudy populationWe conducted the study in a subpopulation of 17249 UKBparticipants in the Lothian region of southeast Scotland all ofwhom are linked to national administrative health datasets (in-cluding hospital primary care and death record codes) Thisregion encompasses the city of Edinburgh where one of UKBrsquosrecruitment centers was located Within this subpopulation weidentified all those with at least one code in their linked health datathat indicated a stroke diagnosis after their recruitment to UKB

Stroke definitionWe defined stroke according to the WHO definition ldquorapidlydeveloping clinical signs of focal (or global) disturbance of ce-rebral function lasting more than 24 hours or leading to deathwith no apparent cause other than that of vascular originrdquo5

Code selection and sourcesWe selected and included relevant codes from hospital ad-mission primary care and death records up to the end ofSeptember 2015 the date at which data were complete for allsources at the time of this study

In the United Kingdom hospital admissions and death recordsare coded by specialized medical coders who use the In-ternational Classification of Diseases (ICD currently version10) coding system to assign a primary code to the main con-dition resulting in a particular admitted episode of care andsecondary codes to other conditions Primary care data arecoded by general practice clinical or administrative staff duringclinical encounters or on receipt of information from elsewhere(eg hospital inpatient and specialist outpatient settings)

Primary care data in the United Kingdom currently use theRead coding system (version 2 in Scotland during the timeperiod of relevance in this study)

We selected relevant stroke codes from the ICD-10 and Readversion 2 coding systems aiming to ascertain cases of stroke witha high positive predictive value (PPV) (ie to minimize theproportion of false-positives among the stroke cases) whileaiming to ascertain as many as possible of the true cases(ie optimizing sensitivity) Our code selections were informedby the results of a systematic review4 browsing of theTechnologyReference Data Update Distribution Service (isdhscicgovuktrud3userguestgroup0home) the Secure Anonymised In-formation Linkage databank cerebrovascular diseases dictionary6

the Quality Outcomes Framework indicator sets for stroke(wwwwalesnhsuk) and the CALIBER online data portal(wwwcaliberresearchorg) as well as manual review of code listsby experts (table e-1 dryadorg105061dryadw9ghx3fk0)

We only validated diagnoses for cases with their first incidentstroke defined as the first-ever-in-a-lifetime occurrence ofa WHO-defined stroke according to codes from linked na-tional administrative datasets We excluded participants whoself-reported a stroke at baseline recruitment or who hada stroke code in their linked health care data predating re-cruitment to UKB This was because (1) incident diseasecases (ie new-onset disease arising in those without a historyof the relevant condition) are of most research interest ina prospective population-based study recruiting mainlyhealthy volunteers and (2) a large proportion of stroke casesidentified through codes or self-reported stroke status arisingprior to recruitment (prevalent cases) predated the EPRsystem used for validation We based our analyses on theearliest code(s) for each participant from any data source

GlossaryCI = confidence interval EPR = electronic patient record ICD = International Classification of Diseases ICH = intracerebralhemorrhage PPV = positive predictive value SAH = subarachnoid hemorrhage UKB = UK Biobank

e698 Neurology | Volume 95 Number 6 | August 11 2020 NeurologyorgN

Process of assigning an expert diagnosisFor each included participant with a stroke code we extractedanonymized and created a single document (hereafter referredto as a ldquovignetterdquo) from all available relevant medical in-formation from the secondary care EPR system includingoutpatient clinic letters hospital discharge summaries andformal radiologic investigation reports (figure e-1 dryadorg105061dryadw9ghx3fk0) We considered information rele-vant if it was associated with a hospital admission leading to thecode or if it was within 2 months before or after the date ofa relevant primary care code A 2-month windowwas chosen toreflect local clinical practice during the time period of the studypatients in NHS Lothian with a suspected stroke or TIA notrequiring hospital admission would have been referred to theneurovascular specialist outpatient clinic where they wouldusually be seen within 1 week in keeping with nationalguidelines (the National Institute for Health and Care Excel-lence guidance wwwniceorgukguidancecg68 and theScottish Stroke Care Audit Standards strokeauditscotnhsukQualityScottish_Stroke_Care_Standardshtml)

Following this specialist clinic visit correspondence to thepatientrsquos general practitioner would generally appear on thesecondary care EPR systemwithin days to a fewweeks providingthe clinical information required to validate primary care codes

We developed an adjudication form which included a sum-mary for adjudicators of criteria for the diagnosis of stroke itspathologic types (ischemic stroke intracerebral hemorrhage[ICH] subarachnoid hemorrhage [SAH]) and subtypes(appendix e-1 dryadorg105061dryadw9ghx3fk0)

Six consultant stroke specialists (hereafter referred to asldquoadjudicatorsrdquo) completed adjudication forms based on theinformation in the EPR-derived vignettes They did not haveaccess to review the brain imaging themselves but formal scanreports were included in the vignettes One adjudicator (KR)completed the forms for all participants and 5 other adjudi-cators (NS CS RA-SS FD WW) independentlycompleted the forms (each for a subset) to allow assessmentof interadjudicator agreement All adjudicators were blindedto which specific codes had led to ascertainment of any caseand to each otherrsquos diagnoses The first adjudicatorrsquos di-agnoses were used as the reference standard for the primaryanalyses in this study

Data analyses

Participantsrsquo characteristicsWe summarized participantsrsquo sex and median age at re-cruitment and at the time the code was assigned

Code source and type identificationWe calculated the number and proportion of stroke-codedparticipants arising from each data source We also stratifiedthe codes into those specifying a stroke type (ischemic vs ICHvs SAH) vs those indicating an unspecified type of stroke and

calculated the numbers and proportions of these arising fromeach source We compared these findings with the sameanalyses for the larger subset of all UKB participants withlinkage to the relevant data sources in England Scotland andWales

Assessment of interadjudicator agreementWe used percent agreement and Cohen kappa (ĸ)7 to mea-sure interadjudicator agreement comparing one adjudicatorrsquos(KR) diagnoses with a second (one of NS CS RA-SSFD WW) for the following stroke vs not strokeTIA vsnot and in cases where both adjudicators agreed that thediagnosis was stroke ischemic stroke vs ICH vs SAH vs un-certain stroke type

Calculating code accuracyThe main measure of accuracy assessed in this study was PPV(the proportion of all stroke-coded cases ascertained that weretrue-positive cases) While we could not directly assess sensi-tivity we were able to assess the effects on both PPV and thenumber of true-positive cases (a higher number of true-positives indicates higher sensitivity) of different code combi-nation strategies for ascertaining cases We categorized thestroke-coded cases as true-or false-positives based on adjudi-catorsrsquo diagnoses We calculated the PPV and its 95 confi-dence interval (CI) using the Clopper Pearson exact methodconsidering both (1) only an adjudicator diagnosis of strokea true-positive and (2) an adjudicator diagnosis of stroke orTIA a true-positive To understand the contribution of eachsource of codes we calculated PPVs for each code sourceseparately (hospital admission vs primary care vs death recorddata) and for their combinations and explored the effect (bothon PPV and on true-positive case numbers) of including onlycodes in the primary position for hospital admission data

We also calculated PPVs for ischemic and hemorrhagic stroketypes and explored the effect (both on PPV and on true-positive case numbers) of restricting hospital data to codes inthe primary position and of treating all unspecified stroke typecodes as ischemic stroke codes

In further analyses we assessed administrative vs overall codeaccuracy and the effect on resulting PPVs as we hypothesizedthat this may explain some of the variability of the resultsamong previous validation studies Some validation studies arebased on the assumption that if the diagnostic code reflects thediagnosis mentioned somewhere in the EPR (often for ex-ample the discharge summary for a particular hospital admis-sion) then this is sufficient to confirm the accuracy of thecode4 We refer to this as the administrative accuracy ofthe code Clinicians will however appreciate that assessing theaccuracy of a diagnostic code is more nuanced A patientrsquos truediagnosis may emerge over time after encounters with severaldifferent clinicians with varying levels of expertise and withaccess to different amounts of relevant information hence theEPR can contain inconsistencies which may require specialistclinical knowledge to detect andor resolve Hence a specialist

NeurologyorgN Neurology | Volume 95 Number 6 | August 11 2020 e699

physicianrsquos review of all relevant parts of the complete EPR willgive a more accurate picture of the true diagnosis and so of thecode accuracy which we refer to as the overall accuracy Weassessed administrative accuracy by having a primary carephysician in our team review the EPR-derived vignettes formentions of stroke diagnoses We assessed overall accuracy byhaving expert adjudicators (consultant level stroke specialists)review the EPR-derived vignettes and derive reference standarddiagnoses based on all available information and their clinicalinterpretation of this information (see Process of assigning anexpert diagnosis) We then assessed whether and by howmuch the administrative vs overall accuracy (PPV) differed forour selected stroke codes

We performed all statistical analyses in R (r-projectorg)

Further exploration of false-positive codesWe compared characteristics of participants with a true- vsa false-positive code and reviewed the vignettes of false-positive cases considering their alternative diagnoses andassessing possible reasons for them being assigned a strokecode Informed by these assessments we evaluated alternativecode combinations that we hypothesized may improve codeaccuracy

Stroke type and subtype distributionsWe further assessed the proportion of true-positive cases thatcould be assigned a pathologic stroke type (ischemic strokeICH SAH) and more detailed subtype by the stroke specialistadjudicator and their frequency distributions

Standard protocol approvals registrationsand patient consentsAll procedures performed in studies involving human partic-ipants were in accordance with the ethical standards of theinstitutional andor national research committee and with the1964 Helsinki declaration and its later amendments or com-parable ethical standards As part of the UKB recruitmentprocess informed consent was obtained from all individualparticipants included in the study

Data availabilityAnonymized summary data based on this study not publishedwithin this article can be shared (subject to approval fromUKB) on request from any qualified investigator unless lim-ited by ethical or legal restrictions Any bona fide researchercan access the UKB resource for research that aims to benefitthe publicrsquos health (see ukbiobankacuk)

ResultsStudy populationIn the subpopulation of 17249 participants 232 had a relevantstroke code occurring after recruitment to UKB Of these 225(97) had available information in their EPR to create a vi-gnette and were included in the study analyses (figure 1)

Participantsrsquo characteristicsOf the 225 stroke-coded cases 111 (49) were female Me-dian age at recruitment to UKB was 63 years (range 41ndash70

Figure 1 Selection of included UK Biobank (UKB) participants

GP = general practitioner NHS = National Health Service

e700 Neurology | Volume 95 Number 6 | August 11 2020 NeurologyorgN

years) and at time of the stroke code was 67 years (range45ndash76 years) (table 1)

Code sources and typesOf the 225 stroke-coded cases 67 (30) received a code fromhospital admission data only 87 (39) from primary caredata only 64 (28) from both hospital admission and pri-mary care data 6 (3) from death records only and one fromboth hospital admission data and death records (table 2 andfigure e-2 dryadorg105061dryadw9ghx3fk0) For thelarger subset of all UKB participants with linkage to the rel-evant data sources these proportions were similar for Scot-land but a higher proportion of participants in England andWales had hospital-only codes (table 2)

As regards stroke types 131225 (58) of stroke-coded caseshad a stroke type-specific code (ischemic ICH or SAH)while the remaining 94 (42) had unspecified stroke codesThe proportion of cases with unspecified stroke type codeswas higher among those ascertained from primary care (105151 [70]) than from hospital admission data (38132[29]) and all death record codes were stroke-type specific(for these code-source level estimates a case would becounted ge1 if occurring in ge1 source) (table 3 and figure e-2dryadorg105061dryadw9ghx3fk0) Among the largersubset of all UKB participants with relevant linked data theproportion of cases with an unspecified stroke code wassimilar for Scotland but lower for England and Wales (19and 30 respectively) (table 3)

Table 1 Participant characteristics

All coded cases (n = 225) True-positive cases (n = 178) False-positive cases (n = 47)

Female 49 (111225) 47 (84178) 57 (2747)

Recruitment age y 63 (41ndash70) 63 (41ndash70) 64 (46ndash69)

Age at code y 67 (45ndash76) 67 (45ndash76) 67 (52ndash76)

Primary care code only 33 (75225) 26 (47178) 60 (2847)

Unspecified stroke code 42 (86225) 38 (68178) 38 (1847)

Hemorrhagic stroke code 22 (49225) 20 (35178) 30 (1447)

Code date lt2011 34 (76225) 33 (58178) 38 (1847)

Dead by end of follow-up (any cause) 8 (18225) 7 (13178) 11 (547)

Values are (n) or median (range)

Table 2 Code sources in the validation study subpopulation and in the larger subset of all UK Biobank (UKB) participantslinked to relevant health care datasets in Scotland England and Wales

UKB participants ScotlandLothian (n = 17249)

UKB participantsScotland alla (n = 31426)

UKB participantsEngland alla (n = 18494)

UKB participantsWales alla (n = 21346)

Total stroke codes 225 553 197 301

Hospital only codes 30 (67225) 27 (147553) 42 (82197) 43 (129301)

Primary care only codes 39 (87225) 44 (241553) 29 (58197) 39 (116301)

Death record only codes 3 (6225) 1 (7553) 0 (0197) 3 (10301)

Hospital and primary carecodes

28 (64225) 28 (156553) 29 (57197) 14 (43301)

Hospital and death recordcodes

0 (1225) 0 (2553) 0 (0197) 1 (2301)

Primary care and deathrecord codes

0 (0225) 0 (0553) 0 (0197) 0 (0301)

Hospital and primary careand death record codes

0 (0225) 0 (0553) 0 (0197) 0 (1301)

Values are (n)a We included UKB participants with linkage to hospital admissions data death records and primary care data (whereas around half of UKB participants arelinked to primary care data using Read V2 or V3 coding systems we included only practices using the Read V2 coding system for ease of comparison)

NeurologyorgN Neurology | Volume 95 Number 6 | August 11 2020 e701

Interadjudicator agreementAll vignettes were independently reviewed by 2 adjudicators Forassigning a diagnosis of stroke vs not interadjudicator agreementwas 91 and Cohen kappa very good at 07 (95 CI 06ndash08)for strokeTIA vs not agreementwas 93 andCohen kappa verygood at 07 (95 CI 06ndash09) Agreement for assigning a stroketype (ischemic stroke vs ICH vs SAH vs uncertain stroke type)was 99 and Cohen kappa excellent at 098 (95 CI 09ndash1)

Code accuracyThe overall PPV of case ascertainment for all 225 stroke-coded cases from all sources combined was 79 (95 CI73ndash84) When broken down by code source PPV washighest for the 132 cases with hospital admission codes (8995 CI 82ndash94) and higher still when limiting analyses toprimary position hospital codes only (94 95 CI88ndash98) at the expense of losing a few (lt10) true-positive cases PPV for the 151 cases with primary care codeswas 80 (95 CI 72ndash86) Only 7 cases had death recordcodes with PPV 57 (95 CI 18ndash90) the wide CIsindicating limited precision of this estimate For cases withboth a hospital admission and a primary care code PPV wasvery high (97 95 CI 89ndash100) but only a third of thecases fell into this category (figure 2A)

As regards the accuracy of identifying stroke pathologic type(for these estimates a case would be counted in ge1 analysis if ithad ge1 unique stroke type code) PPV among the 88 partic-ipants with an ischemic stroke code was 83 (95 CI74ndash90) restricting to primary position codes (for thosecases from hospital admission data) increased this to 86 (95CI 76ndash93) with loss of lt10 of true-positive cases (figure2B) Given that ischemic stroke is the most common patho-logic stroke type and hence an unspecified stroke code is muchmore likely to signify an ischemic rather than a hemorrhagic

stroke case we calculated the PPV for ischemic stroke of the184 ischemic and unspecified stroke codes combined Whilethis resulted in a slightly lower PPV of 80 (95 CI73ndash85) it approximately doubled the number of true-positive ischemic stroke cases identified The proportion oftrue-positive ischemic stroke cases among all cases with anunspecified code was 77 Accuracy of hemorrhagic strokecodes was lower but the small numbers of coded cases resultedin limited precision PPV among 26 participants with an ICHcode was 42 (95 CI 23ndash63) and among 24 participantswith a SAH code was 71 (95 CI 49ndash87) Restricting toprimary position codes increased the PPV of both ICH andSAH codes without losing any true-positive cases

PPV when measuring only administrative accuracy was 93 de-creasing to 79 when measuring overall accuracy This demon-strates the importance of expert-led adjudication in validationstudies as ignoring it would lead to falsely inflated PPVs (figure 3)

Further exploration of false-positive codesThe 47 false-positive stroke-coded cases had a similar pro-portion of unspecified codes and median age (at date of re-cruitment or first stroke code) as true-positives False-positivecases were more likely to have a primary care code and to havedied by the end of follow-up and were slightly more likely tobe female to have a hemorrhagic stroke code and to have anearlier first stroke code date (table 1)

Alternative diagnoses for the 47 false-positives included TIA(1447 30) secondary intracranial bleed not due to SAH orICH (eg traumatic tumor-related bleeding into a cerebralinfarct) (1147 [23]) radiologic finding of a suspected oldvascular lesion (usually incidental) (847 [17]) and pos-sible or probable alternative diagnosis (eg migraine de-myelination seizure or another diagnosis) (1447 [30])

Table 3 Proportion of cases with unspecified vs specified stroke type codes in the validation study subpopulation and inthe larger subset of all UKBiobank (UKB) participants linked to relevant health care datasets in Scotland Englandand Wales

UKB participants ScotlandLothian (n = 17249)

UKB participantsScotland alla (n = 31426)

UKB participants Englandalla (n = 18494)

UKB participants Walesalla (n = 21346)

Total stroke codes 225 553 197 301

Unspecified codes inhospital data

29 (38132) 27 (81305) 7 (10139) 17 (30175)

Unspecified codes inprimary care

70 (105151) 64 (254397) 45 (52115) 50 (80160)

Unspecified codes indeath records

0 (07) 11 (19) na 54 (713)

Unspecified codestotalb

42 (94225) 40 (219553) 19 (37197) 30 (89301)

Values are (n)a We included UKB participants with linkage to hospital admissions data death records and primary care data (whereas around half of UKB participants arelinked to primary care data using Read V2 or V3 coding systems we included only practices using the Read V2 coding system for ease of comparison)b Stroke-coded cases with an unspecified stroke type code (note that if a case has codes from multiple sources including a specified stroke type code inhospital data and an unspecified stroke type code in primary care data or death record data they will count as having a specified code)

e702 Neurology | Volume 95 Number 6 | August 11 2020 NeurologyorgN

Among the 15 participants with a false-positive ICH code 715had another primary stroke type plusmn secondary ICH 415intratumor bleed 315 traumatic or other intracranial bleedand 115 no obvious reason for an ICH code Among the 7participants with a false-positive SAH code 47 were traumaticintracranial bleed 27 another primary stroke type plusmn secondarySAH component and 17 asymptomatic aneurysm

Because performance for hemorrhagic stroke codes wassuboptimal having observed that the more common alter-native diagnoses were traumatic and intratumor bleeds orstroke complicated by a bleed we performed additionalanalyses to explore the effects of different code inclusioncriteria on numbers of cases numbers of true-positives andPPV for ICH and SAH For both ICH and SAH we couldincrease PPV but at the expense of failing to detect some true-positive cases However numbers of cases were too small todraw firm conclusions (figure 4)

Stroke type and subtype distributionsWith information from the EPR a stroke specialist was able toassign a pathologic stroke type to 177 of the 178 true-positivestroke cases 149 (84) were ischemic 11 (6) ICH and 17(10) SAH Depending on the subclassification system usedan ischemic subtype could be determined for 37ndash87 ofcases an ICH subtype for 27ndash100 of cases and SAHsubtype for 94 of cases (figure e-3 dryadorg105061dryadw9ghx3fk0)

DiscussionOur results suggest that stroke cases and cases of ischemicstroke type can be ascertained in UKB through linked codeddata with sufficient accuracy for use in many genetic andepidemiologic research studies without further expert vali-dation since PPVs were generally at least 80 despite

Figure 2 Positive predictive values (PPVs) of stroke codes

PPVs of stroke codes stratified by code source (A) and code type (B) Primary position includes primary care codes where no code position is specified andonly primary position hospital admission codes CI = confidence interval ICH = intracerebral hemorrhage SAH = subarachnoid hemorrhage

NeurologyorgN Neurology | Volume 95 Number 6 | August 11 2020 e703

stringent adjudication criteria Primary care is an importantcode source with ge13 of the cases ascertained exclusivelyvia primary care data in this UK setting Code accuracyappeared slightly better for hospital admission compared toprimary care codes There were insufficient data to draw firmconclusions regarding the accuracy of death record orhemorrhagic stroke type codes Including only the primaryposition codes from hospital admission data increased PPVwithout loss of a substantial proportion of true-positivecases In the subpopulation of Scotland studied only around60 of codes were specific for a stroke type however usingall available relevant medical information from the EPR anexpert adjudicator could assign a stroke type to 99 anda more detailed stroke subtype to between a third and almost90 of cases depending on the subclassification systemused A higher proportion of participants had a specifiedstroke type code in the English and Welsh data but theaccuracy of these codes needs further investigation We alsodemonstrated that at least in this setting considering onlythe administrative accuracy of stroke diagnosis codes (ratherthan overall accuracy based on an expert review of the fullEPR) may give falsely inflated PPVs

Acceptable levels of accuracy and the relative importanceof different accuracy metrics depend on the context8 UKBis primarily used for research into the genetic and non-genetic determinants of disease3 In such analyses it isimportant to ensure that a high proportion of participantsidentified as disease cases truly do have the disease (high

PPV) to minimize bias in effect estimates while aiming tooptimize statistical power by ascertaining as many true-positive cases as possible (optimizing sensitivity but notnecessarily maximizing it as this may compromise thePPV) A high specificity (the proportion of participantswithout the disease that do not receive a stroke code) iscrucial in obtaining a high PPV but is not sufficient in andof itself In population-based prospective cohorts where theproportion of all participants who are true-positives forthe disease (in this case stroke) is generally low the pro-portion of all participants incorrectly classified as havingstroke (false-positives) will generally be low (giving highspecificity) even if the absolute number of false-positives ishigh compared to the absolute number of true-positives(low PPV)8 Providing appropriate codes are used boththe specificity and negative predictive value of routinelycollected health care data to identify disease casesin population-based studies are usually very high(96ndash100)4 For these reasons we focused on estimatingthe PPV of using routinely collected health care data toidentify stroke cases in UKB and on assessing the effects ofdifferent code and source selections on both PPV andnumber of true-positive cases

Strengths of this study include (1) the overall large number ofparticipants (2) inclusion of primary care and death recordcodes for which accuracy data have previously been limited(3) creation of vignettes using preset criteria from the EPR for97 of otherwise eligible stroke-coded cases thus avoiding

Figure 3 Assessing administrative vs overall accuracy

High clinical certainty mentions of stroke ldquostrokerdquo ldquoprobable strokerdquo ldquopresumptive strokerdquo ldquoconsistent with strokerdquo ldquocompatible with strokerdquo ldquolikely strokerdquoldquotreated as strokerdquo or equivalent stroke terms (ICH SAH intracerebral hemorrhage subarachnoid hemorrhage ischemic stroke infarct) Medium clinicalcertainty mentions of stroke including above plus ldquopossible strokerdquo ldquosuspected strokerdquo ldquoimpression of strokerdquo ldquosuggestive of strokerdquo ldquoquery strokerdquo orequivalent stroke terms (ICH SAH intracerebral hemorrhage subarachnoid hemorrhage ischemic stroke infarct) Low clinical certainty mentions of strokeinclude above plus ldquoTIArdquo and ldquotransient ischaemic attackrdquo with any level of certainty preceding it The hierarchy of clinical certainty was based on the ICD-10clinical coding instruction manual (isdscotlandorgProducts-and-ServicesTerminology-ServicesClinical-Coding- Guidelines April 2010 version) which isused by the coding departments in UK hospitals CI = confidence interval PPV = positive predictive value

e704 Neurology | Volume 95 Number 6 | August 11 2020 NeurologyorgN

the bias of selecting participants with a higher prior proba-bility of having had a stroke (4) blinding of adjudicators tocodes and each otherrsquos diagnoses (5) assessment of inter-adjudicator reliability (6) further analysis of false-positivecodes and (7) assessment of both administrative and overallaccuracy demonstrating the importance of an expert-ledreference standard in validation studies

There are some limitations (1) small numbers precludedrobust conclusions about the accuracy of death record andhemorrhagic stroke codes (2) our lack of access to the fullfree text EPRs held exclusively by the primary care systemmay have led to slightly more conservative PPV estimates(although we would not expect this to be a major issue forstroke) as almost all stroke cases have acute inpatient oroutpatient management in secondary care (in keeping withthis is that we were able to find secondary care records for97 of the cases) and information provided from primaryto secondary care was available in the hospital EPR (3) thepotential difficulty in differentiating between definite false-positive and uncertain cases from retrospective review ofmedical records which may have further reduced the es-timated PPVs and (4) the restriction of the current studyto a Scottish subpopulation of the UKB participants (al-though we have no reason to suspect substantial variationin the results considering that the code generation processacross the United Kingdom follows similar pathways andgiven the broadly similar distribution of code sources in

the wider UKB population across England Scotland andWales)

The present study is limited to validating the accuracy of codesfor symptomatic stroke (as defined by the WHO) Subclinicalcerebrovascular disease (such as that detected by brain imaging)which is more common particularly in older people9 would notbe readily ascertained by the linked coded health care datasources used to follow the health of UKB participants but can bedetected through brain imaging conducted as part of the UKBmultimodal imaging study of 100000 of its participants

These results are largely consistent with those from our earliersystematic review4 where PPVs for stroke-specific ICD-10hospital admission codes ranged from 79 to 831011 andfor ischemic stroke from 86 to 88101213 Our results forICH and SAH code accuracy among a small number of par-ticipants were worse than those from the 2 previous UnitedKingdomndashbased studies of ICD-10 or Read codes1314 Thismay be in part because these studies checked only (or mainly)for administrative accuracy hence inflating the PPVs Ourresults were similar to a Danish study which also used anexpert review of medical records to check for overall accu-racy10 To our knowledge there are no previous validationstudies of stroke-specific UK primary care Read codes ordeath record codes for diagnosis of all types of stroke nor ofthe effect of combining primary care and hospital admissioncoded data

Figure 4 Exploratory analyses to improve accuracy of hemorrhagic stroke codes

Excluding other stroke code sameday excluded cases with a diagnostic code for gt1 stroke pathologic type on the sameday This was done because a patientwho has one pathologic stroke type (eg ischemic stroke) can sometimes develop a complication and subsequent brain scan appearances similar to anotherpathologic stroke type (eg a patient with ischemic stroke can have a bleed in the brain as a result of the ischemic stroke which could lead to a false diagnosisof an intracerebral hemorrhage [ICH]) CI = confidence interval PPV = positive predictive value SAH = subarachnoid hemorrhage

NeurologyorgN Neurology | Volume 95 Number 6 | August 11 2020 e705

Further work will be required to further explore the accuracyof hemorrhagic stroke and death record codes in largernumbers of cases by expanding this work to other UKB re-cruitment locations which will also enable assessment of thegeneralizability of these results to other regions of the UnitedKingdom In our Scottish subpopulation while over one-thirdof the codes were not specific for a stroke type expert adju-dication allowed a stroke type to be assigned for almost allcases Future studies should investigate ways of automatingthis process as well as methods for determining more detailedstroke subtypes

Our results suggest that stroke and ischemic stroke cases inUKB can be identified through linked coded data withsufficient accuracy for many genetic and epidemiologicstudies while there are currently insufficient data to drawdefinite conclusions about the accuracy of hemorrhagicstroke codes

AcknowledgmentThe authors thank the UK Biobank participants UK Biobankscientific project and data management teams in OxfordStockport and Edinburgh and the UK Biobank strokeoutcomes working group (Oxford University Valerie BeralYiping Chen Zhengming Chen Jane Green Mary KrollSarah Lewington Peter Rothwell Lucy Wright EdinburghUniversity Martin Dennis Joanna Wardlaw Sarah Wild)This research was awarded the University of EdinburghArthur Fonville prize

Study fundingThis work was supported by salary from UK Biobank andHealth Data Research UK (HDRUK) fellowship (KR)salary from Dementias Platform UK (DPUK) clinical fel-lowship (KB) salary from Medical Research Council Clin-ical Research Training Fellowship (MRP0018231) (TW)salary from UK Biobank and DPUK (RF CS) salary fromUK Biobank (KW RW QZ) and salary from UK Bio-bank Scottish Funding Council DPUK and HDRUK(CLMS) (Chief Scientist for UK Biobank) Project man-agement and administration funding provided by UKBiobank

DisclosureK Rannikmae K Ngoh K Bush R Salman F Doubal RFlaig D Henshall A Hutchison J Nolan S Osborne NSamarasekera C Schnier W Whiteley T Wilkinson KWilson R Woodfield and Q Zhang report no disclosuresrelevant to the manuscript N Allen was senior epidemiologistwhen this study was conducted C Sudlow was chief scientistfor the UK Biobank study when this study was conducted Goto NeurologyorgN for full disclosures

Publication historyReceived by Neurology July 4 2019 Accepted in final formJanuary 27 2020

Appendix Authors

Name Location Contribution

KristiinaRannikmaePhD

UniversityofEdinburgh

Substantial contributions to theconception and design of the workacquisition analysis andinterpretation of data

Kenneth Ngoh UniversityofEdinburgh

Substantial contributions toacquisition analysis andinterpretation of data

Kathryn Bush UniversityofEdinburgh

Substantial contributions to theanalysis and interpretation of data

Rustam Al-Shahi SalmanPhD

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

Fergus DoubalPhD

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

Robin Flaig UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

David EHenshall

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

AidanHutchison

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

John Nolan UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

Scott Osborne UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

NeshikaSamarasekeraPhD

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

ChristianSchnier PhD

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

Will WhiteleyPhD

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

Tim Wilkinson UniversityofEdinburgh

Substantial contributions to theconception and design of the work andinterpretation of data

Kirsty Wilson UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

RebeccaWoodfield PhD

UniversityofEdinburgh

Substantial contributions to theconception of the work andinterpretation of data

Qiuli ZhangPhD

UniversityofEdinburgh

Substantial contributions to theconception and design acquisitionanalysis and interpretation of data

Naomi AllenPhD

Universityof Oxford

Substantial contributions to theacquisition and interpretation of data

Cathie LMSudlow PhD

UniversityofEdinburgh

Substantial contributions to theconception and design of the workacquisition and interpretation of data

e706 Neurology | Volume 95 Number 6 | August 11 2020 NeurologyorgN

References1 Lozano R Naghavi M Foreman K et al Global and regional mortality from 235

causes of death for 20 age groups in 1990 and 2010 a systematic analysis for theGlobal Burden of Disease Study 2010 Lancet 20123802095ndash2128

2 Burton PR Hansell AL Fortier I et al Size matters just how big is BIG quantifyingrealistic sample size requirements for human genome epidemiology Int J Epidemiol2009263ndash273

3 Sudlow C Gallacher J Allen N et al UK Biobank an open access resource foridentifying the causes of a wide range of complex diseases of middle and old age PLoSMed 201512e1001779

4 Woodfield R Grant I UK Biobank Stroke Outcomes Group UK Biobank Follow-Upand OutcomesWorking Group Sudlow CL Accuracy of electronic health record datafor identifying stroke cases in large-scale epidemiological studies a systematic reviewfrom the UK Biobank stroke outcomes group PLoS One 201510e0140533

5 WHO MONICA Project Investigators The World Health Organization MONICAproject (Monitoring Trends and Determinants in Cardiovascular Disease) J ClinEpidemiol 198841105ndash114

6 Ford DV Jones KH Verplancke JP et al The SAIL Databank building a nationalarchitecture for e-health research and evaluation BMC Health Serv Res 20099157

7 Viera AJ Garrett JM Understanding interobserver agreement the kappa statisticFam Med 200537360ndash363

8 Chubak J Pocobelli G Weiss NS Tradeoffs between accuracy measures for electronichealth care data algorithms J Clin Epidemiol 201265343ndash349e2

9 KirkmanMAMahattanakul W Gregson BAMendelow AD The accuracy of hospitaldischarge coding for hemorrhagic stroke Acta Neurol Belg 2009109114ndash119

10 Johnsen SP Overvad K Soslashrensen HT Tjoslashnneland A Husted SE Predictive value ofstroke and transient ischemic attack discharge diagnoses in the Danish NationalRegistry of Patients J Clin Epidemiol 200255602ndash607

11 Krarup LH Boysen G Janjua H Prescott E Truelsen T Validity of stroke diagnosesin a national register of patients Neuroepidemiology 200728150ndash154

12 Haesebaert J Termoz A Polazzi S et al Can hospital discharge databases be used tofollow ischemic stroke incidence Stroke 2013441770ndash1774

13 Wright FL Green J Canoy D et al Vascular disease in women comparison ofdiagnoses in hospital episode statistics and general practice records in England BMCMed Res Methodol 201212161

14 Gaist D Wallander MA Gonzalez-Perez A Garcia-Rodriguez L Incidence of hem-orrhagic stroke in the general population validation of data from the Health Im-provement Network Pharmacoepidemiol Drug Saf 201322176ndash182

NeurologyorgN Neurology | Volume 95 Number 6 | August 11 2020 e707

DOI 101212WNL0000000000009924202095e697-e707 Published Online before print July 2 2020Neurology

Kristiina Rannikmaumle Kenneth Ngoh Kathryn Bush et al Biobank

Accuracy of identifying incident stroke cases from linked health care data in UK

This information is current as of July 2 2020

ServicesUpdated Information amp

httpnneurologyorgcontent956e697fullincluding high resolution figures can be found at

References httpnneurologyorgcontent956e697fullref-list-1

This article cites 13 articles 1 of which you can access for free at

Citations httpnneurologyorgcontent956e697fullotherarticles

This article has been cited by 1 HighWire-hosted articles

Subspecialty Collections

httpnneurologyorgcgicollectionsubarachnoid_hemorrhageSubarachnoid hemorrhage

httpnneurologyorgcgicollectionintracerebral_hemorrhageIntracerebral hemorrhage

httpnneurologyorgcgicollectioninfarctionInfarction

httpnneurologyorgcgicollectioncohort_studiesCohort studies e

httpnneurologyorgcgicollectionall_cerebrovascular_disease_strokAll Cerebrovascular diseaseStrokefollowing collection(s) This article along with others on similar topics appears in the

Permissions amp Licensing

httpwwwneurologyorgaboutabout_the_journalpermissionsits entirety can be found online atInformation about reproducing this article in parts (figurestables) or in

Reprints

httpnneurologyorgsubscribersadvertiseInformation about ordering reprints can be found online

ISSN 0028-3878 Online ISSN 1526-632XWolters Kluwer Health Inc on behalf of the American Academy of Neurology All rights reserved Print1951 it is now a weekly with 48 issues per year Copyright Copyright copy 2020 The Author(s) Published by

reg is the official journal of the American Academy of Neurology Published continuously sinceNeurology

Page 3: Accuracyofidentifyingincidentstrokecasesfrom linked health ... · UK Biobank (UKB) is a prospective population-based cohort study with extensive phenotypic and genotypic information

Process of assigning an expert diagnosisFor each included participant with a stroke code we extractedanonymized and created a single document (hereafter referredto as a ldquovignetterdquo) from all available relevant medical in-formation from the secondary care EPR system includingoutpatient clinic letters hospital discharge summaries andformal radiologic investigation reports (figure e-1 dryadorg105061dryadw9ghx3fk0) We considered information rele-vant if it was associated with a hospital admission leading to thecode or if it was within 2 months before or after the date ofa relevant primary care code A 2-month windowwas chosen toreflect local clinical practice during the time period of the studypatients in NHS Lothian with a suspected stroke or TIA notrequiring hospital admission would have been referred to theneurovascular specialist outpatient clinic where they wouldusually be seen within 1 week in keeping with nationalguidelines (the National Institute for Health and Care Excel-lence guidance wwwniceorgukguidancecg68 and theScottish Stroke Care Audit Standards strokeauditscotnhsukQualityScottish_Stroke_Care_Standardshtml)

Following this specialist clinic visit correspondence to thepatientrsquos general practitioner would generally appear on thesecondary care EPR systemwithin days to a fewweeks providingthe clinical information required to validate primary care codes

We developed an adjudication form which included a sum-mary for adjudicators of criteria for the diagnosis of stroke itspathologic types (ischemic stroke intracerebral hemorrhage[ICH] subarachnoid hemorrhage [SAH]) and subtypes(appendix e-1 dryadorg105061dryadw9ghx3fk0)

Six consultant stroke specialists (hereafter referred to asldquoadjudicatorsrdquo) completed adjudication forms based on theinformation in the EPR-derived vignettes They did not haveaccess to review the brain imaging themselves but formal scanreports were included in the vignettes One adjudicator (KR)completed the forms for all participants and 5 other adjudi-cators (NS CS RA-SS FD WW) independentlycompleted the forms (each for a subset) to allow assessmentof interadjudicator agreement All adjudicators were blindedto which specific codes had led to ascertainment of any caseand to each otherrsquos diagnoses The first adjudicatorrsquos di-agnoses were used as the reference standard for the primaryanalyses in this study

Data analyses

Participantsrsquo characteristicsWe summarized participantsrsquo sex and median age at re-cruitment and at the time the code was assigned

Code source and type identificationWe calculated the number and proportion of stroke-codedparticipants arising from each data source We also stratifiedthe codes into those specifying a stroke type (ischemic vs ICHvs SAH) vs those indicating an unspecified type of stroke and

calculated the numbers and proportions of these arising fromeach source We compared these findings with the sameanalyses for the larger subset of all UKB participants withlinkage to the relevant data sources in England Scotland andWales

Assessment of interadjudicator agreementWe used percent agreement and Cohen kappa (ĸ)7 to mea-sure interadjudicator agreement comparing one adjudicatorrsquos(KR) diagnoses with a second (one of NS CS RA-SSFD WW) for the following stroke vs not strokeTIA vsnot and in cases where both adjudicators agreed that thediagnosis was stroke ischemic stroke vs ICH vs SAH vs un-certain stroke type

Calculating code accuracyThe main measure of accuracy assessed in this study was PPV(the proportion of all stroke-coded cases ascertained that weretrue-positive cases) While we could not directly assess sensi-tivity we were able to assess the effects on both PPV and thenumber of true-positive cases (a higher number of true-positives indicates higher sensitivity) of different code combi-nation strategies for ascertaining cases We categorized thestroke-coded cases as true-or false-positives based on adjudi-catorsrsquo diagnoses We calculated the PPV and its 95 confi-dence interval (CI) using the Clopper Pearson exact methodconsidering both (1) only an adjudicator diagnosis of strokea true-positive and (2) an adjudicator diagnosis of stroke orTIA a true-positive To understand the contribution of eachsource of codes we calculated PPVs for each code sourceseparately (hospital admission vs primary care vs death recorddata) and for their combinations and explored the effect (bothon PPV and on true-positive case numbers) of including onlycodes in the primary position for hospital admission data

We also calculated PPVs for ischemic and hemorrhagic stroketypes and explored the effect (both on PPV and on true-positive case numbers) of restricting hospital data to codes inthe primary position and of treating all unspecified stroke typecodes as ischemic stroke codes

In further analyses we assessed administrative vs overall codeaccuracy and the effect on resulting PPVs as we hypothesizedthat this may explain some of the variability of the resultsamong previous validation studies Some validation studies arebased on the assumption that if the diagnostic code reflects thediagnosis mentioned somewhere in the EPR (often for ex-ample the discharge summary for a particular hospital admis-sion) then this is sufficient to confirm the accuracy of thecode4 We refer to this as the administrative accuracy ofthe code Clinicians will however appreciate that assessing theaccuracy of a diagnostic code is more nuanced A patientrsquos truediagnosis may emerge over time after encounters with severaldifferent clinicians with varying levels of expertise and withaccess to different amounts of relevant information hence theEPR can contain inconsistencies which may require specialistclinical knowledge to detect andor resolve Hence a specialist

NeurologyorgN Neurology | Volume 95 Number 6 | August 11 2020 e699

physicianrsquos review of all relevant parts of the complete EPR willgive a more accurate picture of the true diagnosis and so of thecode accuracy which we refer to as the overall accuracy Weassessed administrative accuracy by having a primary carephysician in our team review the EPR-derived vignettes formentions of stroke diagnoses We assessed overall accuracy byhaving expert adjudicators (consultant level stroke specialists)review the EPR-derived vignettes and derive reference standarddiagnoses based on all available information and their clinicalinterpretation of this information (see Process of assigning anexpert diagnosis) We then assessed whether and by howmuch the administrative vs overall accuracy (PPV) differed forour selected stroke codes

We performed all statistical analyses in R (r-projectorg)

Further exploration of false-positive codesWe compared characteristics of participants with a true- vsa false-positive code and reviewed the vignettes of false-positive cases considering their alternative diagnoses andassessing possible reasons for them being assigned a strokecode Informed by these assessments we evaluated alternativecode combinations that we hypothesized may improve codeaccuracy

Stroke type and subtype distributionsWe further assessed the proportion of true-positive cases thatcould be assigned a pathologic stroke type (ischemic strokeICH SAH) and more detailed subtype by the stroke specialistadjudicator and their frequency distributions

Standard protocol approvals registrationsand patient consentsAll procedures performed in studies involving human partic-ipants were in accordance with the ethical standards of theinstitutional andor national research committee and with the1964 Helsinki declaration and its later amendments or com-parable ethical standards As part of the UKB recruitmentprocess informed consent was obtained from all individualparticipants included in the study

Data availabilityAnonymized summary data based on this study not publishedwithin this article can be shared (subject to approval fromUKB) on request from any qualified investigator unless lim-ited by ethical or legal restrictions Any bona fide researchercan access the UKB resource for research that aims to benefitthe publicrsquos health (see ukbiobankacuk)

ResultsStudy populationIn the subpopulation of 17249 participants 232 had a relevantstroke code occurring after recruitment to UKB Of these 225(97) had available information in their EPR to create a vi-gnette and were included in the study analyses (figure 1)

Participantsrsquo characteristicsOf the 225 stroke-coded cases 111 (49) were female Me-dian age at recruitment to UKB was 63 years (range 41ndash70

Figure 1 Selection of included UK Biobank (UKB) participants

GP = general practitioner NHS = National Health Service

e700 Neurology | Volume 95 Number 6 | August 11 2020 NeurologyorgN

years) and at time of the stroke code was 67 years (range45ndash76 years) (table 1)

Code sources and typesOf the 225 stroke-coded cases 67 (30) received a code fromhospital admission data only 87 (39) from primary caredata only 64 (28) from both hospital admission and pri-mary care data 6 (3) from death records only and one fromboth hospital admission data and death records (table 2 andfigure e-2 dryadorg105061dryadw9ghx3fk0) For thelarger subset of all UKB participants with linkage to the rel-evant data sources these proportions were similar for Scot-land but a higher proportion of participants in England andWales had hospital-only codes (table 2)

As regards stroke types 131225 (58) of stroke-coded caseshad a stroke type-specific code (ischemic ICH or SAH)while the remaining 94 (42) had unspecified stroke codesThe proportion of cases with unspecified stroke type codeswas higher among those ascertained from primary care (105151 [70]) than from hospital admission data (38132[29]) and all death record codes were stroke-type specific(for these code-source level estimates a case would becounted ge1 if occurring in ge1 source) (table 3 and figure e-2dryadorg105061dryadw9ghx3fk0) Among the largersubset of all UKB participants with relevant linked data theproportion of cases with an unspecified stroke code wassimilar for Scotland but lower for England and Wales (19and 30 respectively) (table 3)

Table 1 Participant characteristics

All coded cases (n = 225) True-positive cases (n = 178) False-positive cases (n = 47)

Female 49 (111225) 47 (84178) 57 (2747)

Recruitment age y 63 (41ndash70) 63 (41ndash70) 64 (46ndash69)

Age at code y 67 (45ndash76) 67 (45ndash76) 67 (52ndash76)

Primary care code only 33 (75225) 26 (47178) 60 (2847)

Unspecified stroke code 42 (86225) 38 (68178) 38 (1847)

Hemorrhagic stroke code 22 (49225) 20 (35178) 30 (1447)

Code date lt2011 34 (76225) 33 (58178) 38 (1847)

Dead by end of follow-up (any cause) 8 (18225) 7 (13178) 11 (547)

Values are (n) or median (range)

Table 2 Code sources in the validation study subpopulation and in the larger subset of all UK Biobank (UKB) participantslinked to relevant health care datasets in Scotland England and Wales

UKB participants ScotlandLothian (n = 17249)

UKB participantsScotland alla (n = 31426)

UKB participantsEngland alla (n = 18494)

UKB participantsWales alla (n = 21346)

Total stroke codes 225 553 197 301

Hospital only codes 30 (67225) 27 (147553) 42 (82197) 43 (129301)

Primary care only codes 39 (87225) 44 (241553) 29 (58197) 39 (116301)

Death record only codes 3 (6225) 1 (7553) 0 (0197) 3 (10301)

Hospital and primary carecodes

28 (64225) 28 (156553) 29 (57197) 14 (43301)

Hospital and death recordcodes

0 (1225) 0 (2553) 0 (0197) 1 (2301)

Primary care and deathrecord codes

0 (0225) 0 (0553) 0 (0197) 0 (0301)

Hospital and primary careand death record codes

0 (0225) 0 (0553) 0 (0197) 0 (1301)

Values are (n)a We included UKB participants with linkage to hospital admissions data death records and primary care data (whereas around half of UKB participants arelinked to primary care data using Read V2 or V3 coding systems we included only practices using the Read V2 coding system for ease of comparison)

NeurologyorgN Neurology | Volume 95 Number 6 | August 11 2020 e701

Interadjudicator agreementAll vignettes were independently reviewed by 2 adjudicators Forassigning a diagnosis of stroke vs not interadjudicator agreementwas 91 and Cohen kappa very good at 07 (95 CI 06ndash08)for strokeTIA vs not agreementwas 93 andCohen kappa verygood at 07 (95 CI 06ndash09) Agreement for assigning a stroketype (ischemic stroke vs ICH vs SAH vs uncertain stroke type)was 99 and Cohen kappa excellent at 098 (95 CI 09ndash1)

Code accuracyThe overall PPV of case ascertainment for all 225 stroke-coded cases from all sources combined was 79 (95 CI73ndash84) When broken down by code source PPV washighest for the 132 cases with hospital admission codes (8995 CI 82ndash94) and higher still when limiting analyses toprimary position hospital codes only (94 95 CI88ndash98) at the expense of losing a few (lt10) true-positive cases PPV for the 151 cases with primary care codeswas 80 (95 CI 72ndash86) Only 7 cases had death recordcodes with PPV 57 (95 CI 18ndash90) the wide CIsindicating limited precision of this estimate For cases withboth a hospital admission and a primary care code PPV wasvery high (97 95 CI 89ndash100) but only a third of thecases fell into this category (figure 2A)

As regards the accuracy of identifying stroke pathologic type(for these estimates a case would be counted in ge1 analysis if ithad ge1 unique stroke type code) PPV among the 88 partic-ipants with an ischemic stroke code was 83 (95 CI74ndash90) restricting to primary position codes (for thosecases from hospital admission data) increased this to 86 (95CI 76ndash93) with loss of lt10 of true-positive cases (figure2B) Given that ischemic stroke is the most common patho-logic stroke type and hence an unspecified stroke code is muchmore likely to signify an ischemic rather than a hemorrhagic

stroke case we calculated the PPV for ischemic stroke of the184 ischemic and unspecified stroke codes combined Whilethis resulted in a slightly lower PPV of 80 (95 CI73ndash85) it approximately doubled the number of true-positive ischemic stroke cases identified The proportion oftrue-positive ischemic stroke cases among all cases with anunspecified code was 77 Accuracy of hemorrhagic strokecodes was lower but the small numbers of coded cases resultedin limited precision PPV among 26 participants with an ICHcode was 42 (95 CI 23ndash63) and among 24 participantswith a SAH code was 71 (95 CI 49ndash87) Restricting toprimary position codes increased the PPV of both ICH andSAH codes without losing any true-positive cases

PPV when measuring only administrative accuracy was 93 de-creasing to 79 when measuring overall accuracy This demon-strates the importance of expert-led adjudication in validationstudies as ignoring it would lead to falsely inflated PPVs (figure 3)

Further exploration of false-positive codesThe 47 false-positive stroke-coded cases had a similar pro-portion of unspecified codes and median age (at date of re-cruitment or first stroke code) as true-positives False-positivecases were more likely to have a primary care code and to havedied by the end of follow-up and were slightly more likely tobe female to have a hemorrhagic stroke code and to have anearlier first stroke code date (table 1)

Alternative diagnoses for the 47 false-positives included TIA(1447 30) secondary intracranial bleed not due to SAH orICH (eg traumatic tumor-related bleeding into a cerebralinfarct) (1147 [23]) radiologic finding of a suspected oldvascular lesion (usually incidental) (847 [17]) and pos-sible or probable alternative diagnosis (eg migraine de-myelination seizure or another diagnosis) (1447 [30])

Table 3 Proportion of cases with unspecified vs specified stroke type codes in the validation study subpopulation and inthe larger subset of all UKBiobank (UKB) participants linked to relevant health care datasets in Scotland Englandand Wales

UKB participants ScotlandLothian (n = 17249)

UKB participantsScotland alla (n = 31426)

UKB participants Englandalla (n = 18494)

UKB participants Walesalla (n = 21346)

Total stroke codes 225 553 197 301

Unspecified codes inhospital data

29 (38132) 27 (81305) 7 (10139) 17 (30175)

Unspecified codes inprimary care

70 (105151) 64 (254397) 45 (52115) 50 (80160)

Unspecified codes indeath records

0 (07) 11 (19) na 54 (713)

Unspecified codestotalb

42 (94225) 40 (219553) 19 (37197) 30 (89301)

Values are (n)a We included UKB participants with linkage to hospital admissions data death records and primary care data (whereas around half of UKB participants arelinked to primary care data using Read V2 or V3 coding systems we included only practices using the Read V2 coding system for ease of comparison)b Stroke-coded cases with an unspecified stroke type code (note that if a case has codes from multiple sources including a specified stroke type code inhospital data and an unspecified stroke type code in primary care data or death record data they will count as having a specified code)

e702 Neurology | Volume 95 Number 6 | August 11 2020 NeurologyorgN

Among the 15 participants with a false-positive ICH code 715had another primary stroke type plusmn secondary ICH 415intratumor bleed 315 traumatic or other intracranial bleedand 115 no obvious reason for an ICH code Among the 7participants with a false-positive SAH code 47 were traumaticintracranial bleed 27 another primary stroke type plusmn secondarySAH component and 17 asymptomatic aneurysm

Because performance for hemorrhagic stroke codes wassuboptimal having observed that the more common alter-native diagnoses were traumatic and intratumor bleeds orstroke complicated by a bleed we performed additionalanalyses to explore the effects of different code inclusioncriteria on numbers of cases numbers of true-positives andPPV for ICH and SAH For both ICH and SAH we couldincrease PPV but at the expense of failing to detect some true-positive cases However numbers of cases were too small todraw firm conclusions (figure 4)

Stroke type and subtype distributionsWith information from the EPR a stroke specialist was able toassign a pathologic stroke type to 177 of the 178 true-positivestroke cases 149 (84) were ischemic 11 (6) ICH and 17(10) SAH Depending on the subclassification system usedan ischemic subtype could be determined for 37ndash87 ofcases an ICH subtype for 27ndash100 of cases and SAHsubtype for 94 of cases (figure e-3 dryadorg105061dryadw9ghx3fk0)

DiscussionOur results suggest that stroke cases and cases of ischemicstroke type can be ascertained in UKB through linked codeddata with sufficient accuracy for use in many genetic andepidemiologic research studies without further expert vali-dation since PPVs were generally at least 80 despite

Figure 2 Positive predictive values (PPVs) of stroke codes

PPVs of stroke codes stratified by code source (A) and code type (B) Primary position includes primary care codes where no code position is specified andonly primary position hospital admission codes CI = confidence interval ICH = intracerebral hemorrhage SAH = subarachnoid hemorrhage

NeurologyorgN Neurology | Volume 95 Number 6 | August 11 2020 e703

stringent adjudication criteria Primary care is an importantcode source with ge13 of the cases ascertained exclusivelyvia primary care data in this UK setting Code accuracyappeared slightly better for hospital admission compared toprimary care codes There were insufficient data to draw firmconclusions regarding the accuracy of death record orhemorrhagic stroke type codes Including only the primaryposition codes from hospital admission data increased PPVwithout loss of a substantial proportion of true-positivecases In the subpopulation of Scotland studied only around60 of codes were specific for a stroke type however usingall available relevant medical information from the EPR anexpert adjudicator could assign a stroke type to 99 anda more detailed stroke subtype to between a third and almost90 of cases depending on the subclassification systemused A higher proportion of participants had a specifiedstroke type code in the English and Welsh data but theaccuracy of these codes needs further investigation We alsodemonstrated that at least in this setting considering onlythe administrative accuracy of stroke diagnosis codes (ratherthan overall accuracy based on an expert review of the fullEPR) may give falsely inflated PPVs

Acceptable levels of accuracy and the relative importanceof different accuracy metrics depend on the context8 UKBis primarily used for research into the genetic and non-genetic determinants of disease3 In such analyses it isimportant to ensure that a high proportion of participantsidentified as disease cases truly do have the disease (high

PPV) to minimize bias in effect estimates while aiming tooptimize statistical power by ascertaining as many true-positive cases as possible (optimizing sensitivity but notnecessarily maximizing it as this may compromise thePPV) A high specificity (the proportion of participantswithout the disease that do not receive a stroke code) iscrucial in obtaining a high PPV but is not sufficient in andof itself In population-based prospective cohorts where theproportion of all participants who are true-positives forthe disease (in this case stroke) is generally low the pro-portion of all participants incorrectly classified as havingstroke (false-positives) will generally be low (giving highspecificity) even if the absolute number of false-positives ishigh compared to the absolute number of true-positives(low PPV)8 Providing appropriate codes are used boththe specificity and negative predictive value of routinelycollected health care data to identify disease casesin population-based studies are usually very high(96ndash100)4 For these reasons we focused on estimatingthe PPV of using routinely collected health care data toidentify stroke cases in UKB and on assessing the effects ofdifferent code and source selections on both PPV andnumber of true-positive cases

Strengths of this study include (1) the overall large number ofparticipants (2) inclusion of primary care and death recordcodes for which accuracy data have previously been limited(3) creation of vignettes using preset criteria from the EPR for97 of otherwise eligible stroke-coded cases thus avoiding

Figure 3 Assessing administrative vs overall accuracy

High clinical certainty mentions of stroke ldquostrokerdquo ldquoprobable strokerdquo ldquopresumptive strokerdquo ldquoconsistent with strokerdquo ldquocompatible with strokerdquo ldquolikely strokerdquoldquotreated as strokerdquo or equivalent stroke terms (ICH SAH intracerebral hemorrhage subarachnoid hemorrhage ischemic stroke infarct) Medium clinicalcertainty mentions of stroke including above plus ldquopossible strokerdquo ldquosuspected strokerdquo ldquoimpression of strokerdquo ldquosuggestive of strokerdquo ldquoquery strokerdquo orequivalent stroke terms (ICH SAH intracerebral hemorrhage subarachnoid hemorrhage ischemic stroke infarct) Low clinical certainty mentions of strokeinclude above plus ldquoTIArdquo and ldquotransient ischaemic attackrdquo with any level of certainty preceding it The hierarchy of clinical certainty was based on the ICD-10clinical coding instruction manual (isdscotlandorgProducts-and-ServicesTerminology-ServicesClinical-Coding- Guidelines April 2010 version) which isused by the coding departments in UK hospitals CI = confidence interval PPV = positive predictive value

e704 Neurology | Volume 95 Number 6 | August 11 2020 NeurologyorgN

the bias of selecting participants with a higher prior proba-bility of having had a stroke (4) blinding of adjudicators tocodes and each otherrsquos diagnoses (5) assessment of inter-adjudicator reliability (6) further analysis of false-positivecodes and (7) assessment of both administrative and overallaccuracy demonstrating the importance of an expert-ledreference standard in validation studies

There are some limitations (1) small numbers precludedrobust conclusions about the accuracy of death record andhemorrhagic stroke codes (2) our lack of access to the fullfree text EPRs held exclusively by the primary care systemmay have led to slightly more conservative PPV estimates(although we would not expect this to be a major issue forstroke) as almost all stroke cases have acute inpatient oroutpatient management in secondary care (in keeping withthis is that we were able to find secondary care records for97 of the cases) and information provided from primaryto secondary care was available in the hospital EPR (3) thepotential difficulty in differentiating between definite false-positive and uncertain cases from retrospective review ofmedical records which may have further reduced the es-timated PPVs and (4) the restriction of the current studyto a Scottish subpopulation of the UKB participants (al-though we have no reason to suspect substantial variationin the results considering that the code generation processacross the United Kingdom follows similar pathways andgiven the broadly similar distribution of code sources in

the wider UKB population across England Scotland andWales)

The present study is limited to validating the accuracy of codesfor symptomatic stroke (as defined by the WHO) Subclinicalcerebrovascular disease (such as that detected by brain imaging)which is more common particularly in older people9 would notbe readily ascertained by the linked coded health care datasources used to follow the health of UKB participants but can bedetected through brain imaging conducted as part of the UKBmultimodal imaging study of 100000 of its participants

These results are largely consistent with those from our earliersystematic review4 where PPVs for stroke-specific ICD-10hospital admission codes ranged from 79 to 831011 andfor ischemic stroke from 86 to 88101213 Our results forICH and SAH code accuracy among a small number of par-ticipants were worse than those from the 2 previous UnitedKingdomndashbased studies of ICD-10 or Read codes1314 Thismay be in part because these studies checked only (or mainly)for administrative accuracy hence inflating the PPVs Ourresults were similar to a Danish study which also used anexpert review of medical records to check for overall accu-racy10 To our knowledge there are no previous validationstudies of stroke-specific UK primary care Read codes ordeath record codes for diagnosis of all types of stroke nor ofthe effect of combining primary care and hospital admissioncoded data

Figure 4 Exploratory analyses to improve accuracy of hemorrhagic stroke codes

Excluding other stroke code sameday excluded cases with a diagnostic code for gt1 stroke pathologic type on the sameday This was done because a patientwho has one pathologic stroke type (eg ischemic stroke) can sometimes develop a complication and subsequent brain scan appearances similar to anotherpathologic stroke type (eg a patient with ischemic stroke can have a bleed in the brain as a result of the ischemic stroke which could lead to a false diagnosisof an intracerebral hemorrhage [ICH]) CI = confidence interval PPV = positive predictive value SAH = subarachnoid hemorrhage

NeurologyorgN Neurology | Volume 95 Number 6 | August 11 2020 e705

Further work will be required to further explore the accuracyof hemorrhagic stroke and death record codes in largernumbers of cases by expanding this work to other UKB re-cruitment locations which will also enable assessment of thegeneralizability of these results to other regions of the UnitedKingdom In our Scottish subpopulation while over one-thirdof the codes were not specific for a stroke type expert adju-dication allowed a stroke type to be assigned for almost allcases Future studies should investigate ways of automatingthis process as well as methods for determining more detailedstroke subtypes

Our results suggest that stroke and ischemic stroke cases inUKB can be identified through linked coded data withsufficient accuracy for many genetic and epidemiologicstudies while there are currently insufficient data to drawdefinite conclusions about the accuracy of hemorrhagicstroke codes

AcknowledgmentThe authors thank the UK Biobank participants UK Biobankscientific project and data management teams in OxfordStockport and Edinburgh and the UK Biobank strokeoutcomes working group (Oxford University Valerie BeralYiping Chen Zhengming Chen Jane Green Mary KrollSarah Lewington Peter Rothwell Lucy Wright EdinburghUniversity Martin Dennis Joanna Wardlaw Sarah Wild)This research was awarded the University of EdinburghArthur Fonville prize

Study fundingThis work was supported by salary from UK Biobank andHealth Data Research UK (HDRUK) fellowship (KR)salary from Dementias Platform UK (DPUK) clinical fel-lowship (KB) salary from Medical Research Council Clin-ical Research Training Fellowship (MRP0018231) (TW)salary from UK Biobank and DPUK (RF CS) salary fromUK Biobank (KW RW QZ) and salary from UK Bio-bank Scottish Funding Council DPUK and HDRUK(CLMS) (Chief Scientist for UK Biobank) Project man-agement and administration funding provided by UKBiobank

DisclosureK Rannikmae K Ngoh K Bush R Salman F Doubal RFlaig D Henshall A Hutchison J Nolan S Osborne NSamarasekera C Schnier W Whiteley T Wilkinson KWilson R Woodfield and Q Zhang report no disclosuresrelevant to the manuscript N Allen was senior epidemiologistwhen this study was conducted C Sudlow was chief scientistfor the UK Biobank study when this study was conducted Goto NeurologyorgN for full disclosures

Publication historyReceived by Neurology July 4 2019 Accepted in final formJanuary 27 2020

Appendix Authors

Name Location Contribution

KristiinaRannikmaePhD

UniversityofEdinburgh

Substantial contributions to theconception and design of the workacquisition analysis andinterpretation of data

Kenneth Ngoh UniversityofEdinburgh

Substantial contributions toacquisition analysis andinterpretation of data

Kathryn Bush UniversityofEdinburgh

Substantial contributions to theanalysis and interpretation of data

Rustam Al-Shahi SalmanPhD

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

Fergus DoubalPhD

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

Robin Flaig UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

David EHenshall

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

AidanHutchison

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

John Nolan UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

Scott Osborne UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

NeshikaSamarasekeraPhD

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

ChristianSchnier PhD

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

Will WhiteleyPhD

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

Tim Wilkinson UniversityofEdinburgh

Substantial contributions to theconception and design of the work andinterpretation of data

Kirsty Wilson UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

RebeccaWoodfield PhD

UniversityofEdinburgh

Substantial contributions to theconception of the work andinterpretation of data

Qiuli ZhangPhD

UniversityofEdinburgh

Substantial contributions to theconception and design acquisitionanalysis and interpretation of data

Naomi AllenPhD

Universityof Oxford

Substantial contributions to theacquisition and interpretation of data

Cathie LMSudlow PhD

UniversityofEdinburgh

Substantial contributions to theconception and design of the workacquisition and interpretation of data

e706 Neurology | Volume 95 Number 6 | August 11 2020 NeurologyorgN

References1 Lozano R Naghavi M Foreman K et al Global and regional mortality from 235

causes of death for 20 age groups in 1990 and 2010 a systematic analysis for theGlobal Burden of Disease Study 2010 Lancet 20123802095ndash2128

2 Burton PR Hansell AL Fortier I et al Size matters just how big is BIG quantifyingrealistic sample size requirements for human genome epidemiology Int J Epidemiol2009263ndash273

3 Sudlow C Gallacher J Allen N et al UK Biobank an open access resource foridentifying the causes of a wide range of complex diseases of middle and old age PLoSMed 201512e1001779

4 Woodfield R Grant I UK Biobank Stroke Outcomes Group UK Biobank Follow-Upand OutcomesWorking Group Sudlow CL Accuracy of electronic health record datafor identifying stroke cases in large-scale epidemiological studies a systematic reviewfrom the UK Biobank stroke outcomes group PLoS One 201510e0140533

5 WHO MONICA Project Investigators The World Health Organization MONICAproject (Monitoring Trends and Determinants in Cardiovascular Disease) J ClinEpidemiol 198841105ndash114

6 Ford DV Jones KH Verplancke JP et al The SAIL Databank building a nationalarchitecture for e-health research and evaluation BMC Health Serv Res 20099157

7 Viera AJ Garrett JM Understanding interobserver agreement the kappa statisticFam Med 200537360ndash363

8 Chubak J Pocobelli G Weiss NS Tradeoffs between accuracy measures for electronichealth care data algorithms J Clin Epidemiol 201265343ndash349e2

9 KirkmanMAMahattanakul W Gregson BAMendelow AD The accuracy of hospitaldischarge coding for hemorrhagic stroke Acta Neurol Belg 2009109114ndash119

10 Johnsen SP Overvad K Soslashrensen HT Tjoslashnneland A Husted SE Predictive value ofstroke and transient ischemic attack discharge diagnoses in the Danish NationalRegistry of Patients J Clin Epidemiol 200255602ndash607

11 Krarup LH Boysen G Janjua H Prescott E Truelsen T Validity of stroke diagnosesin a national register of patients Neuroepidemiology 200728150ndash154

12 Haesebaert J Termoz A Polazzi S et al Can hospital discharge databases be used tofollow ischemic stroke incidence Stroke 2013441770ndash1774

13 Wright FL Green J Canoy D et al Vascular disease in women comparison ofdiagnoses in hospital episode statistics and general practice records in England BMCMed Res Methodol 201212161

14 Gaist D Wallander MA Gonzalez-Perez A Garcia-Rodriguez L Incidence of hem-orrhagic stroke in the general population validation of data from the Health Im-provement Network Pharmacoepidemiol Drug Saf 201322176ndash182

NeurologyorgN Neurology | Volume 95 Number 6 | August 11 2020 e707

DOI 101212WNL0000000000009924202095e697-e707 Published Online before print July 2 2020Neurology

Kristiina Rannikmaumle Kenneth Ngoh Kathryn Bush et al Biobank

Accuracy of identifying incident stroke cases from linked health care data in UK

This information is current as of July 2 2020

ServicesUpdated Information amp

httpnneurologyorgcontent956e697fullincluding high resolution figures can be found at

References httpnneurologyorgcontent956e697fullref-list-1

This article cites 13 articles 1 of which you can access for free at

Citations httpnneurologyorgcontent956e697fullotherarticles

This article has been cited by 1 HighWire-hosted articles

Subspecialty Collections

httpnneurologyorgcgicollectionsubarachnoid_hemorrhageSubarachnoid hemorrhage

httpnneurologyorgcgicollectionintracerebral_hemorrhageIntracerebral hemorrhage

httpnneurologyorgcgicollectioninfarctionInfarction

httpnneurologyorgcgicollectioncohort_studiesCohort studies e

httpnneurologyorgcgicollectionall_cerebrovascular_disease_strokAll Cerebrovascular diseaseStrokefollowing collection(s) This article along with others on similar topics appears in the

Permissions amp Licensing

httpwwwneurologyorgaboutabout_the_journalpermissionsits entirety can be found online atInformation about reproducing this article in parts (figurestables) or in

Reprints

httpnneurologyorgsubscribersadvertiseInformation about ordering reprints can be found online

ISSN 0028-3878 Online ISSN 1526-632XWolters Kluwer Health Inc on behalf of the American Academy of Neurology All rights reserved Print1951 it is now a weekly with 48 issues per year Copyright Copyright copy 2020 The Author(s) Published by

reg is the official journal of the American Academy of Neurology Published continuously sinceNeurology

Page 4: Accuracyofidentifyingincidentstrokecasesfrom linked health ... · UK Biobank (UKB) is a prospective population-based cohort study with extensive phenotypic and genotypic information

physicianrsquos review of all relevant parts of the complete EPR willgive a more accurate picture of the true diagnosis and so of thecode accuracy which we refer to as the overall accuracy Weassessed administrative accuracy by having a primary carephysician in our team review the EPR-derived vignettes formentions of stroke diagnoses We assessed overall accuracy byhaving expert adjudicators (consultant level stroke specialists)review the EPR-derived vignettes and derive reference standarddiagnoses based on all available information and their clinicalinterpretation of this information (see Process of assigning anexpert diagnosis) We then assessed whether and by howmuch the administrative vs overall accuracy (PPV) differed forour selected stroke codes

We performed all statistical analyses in R (r-projectorg)

Further exploration of false-positive codesWe compared characteristics of participants with a true- vsa false-positive code and reviewed the vignettes of false-positive cases considering their alternative diagnoses andassessing possible reasons for them being assigned a strokecode Informed by these assessments we evaluated alternativecode combinations that we hypothesized may improve codeaccuracy

Stroke type and subtype distributionsWe further assessed the proportion of true-positive cases thatcould be assigned a pathologic stroke type (ischemic strokeICH SAH) and more detailed subtype by the stroke specialistadjudicator and their frequency distributions

Standard protocol approvals registrationsand patient consentsAll procedures performed in studies involving human partic-ipants were in accordance with the ethical standards of theinstitutional andor national research committee and with the1964 Helsinki declaration and its later amendments or com-parable ethical standards As part of the UKB recruitmentprocess informed consent was obtained from all individualparticipants included in the study

Data availabilityAnonymized summary data based on this study not publishedwithin this article can be shared (subject to approval fromUKB) on request from any qualified investigator unless lim-ited by ethical or legal restrictions Any bona fide researchercan access the UKB resource for research that aims to benefitthe publicrsquos health (see ukbiobankacuk)

ResultsStudy populationIn the subpopulation of 17249 participants 232 had a relevantstroke code occurring after recruitment to UKB Of these 225(97) had available information in their EPR to create a vi-gnette and were included in the study analyses (figure 1)

Participantsrsquo characteristicsOf the 225 stroke-coded cases 111 (49) were female Me-dian age at recruitment to UKB was 63 years (range 41ndash70

Figure 1 Selection of included UK Biobank (UKB) participants

GP = general practitioner NHS = National Health Service

e700 Neurology | Volume 95 Number 6 | August 11 2020 NeurologyorgN

years) and at time of the stroke code was 67 years (range45ndash76 years) (table 1)

Code sources and typesOf the 225 stroke-coded cases 67 (30) received a code fromhospital admission data only 87 (39) from primary caredata only 64 (28) from both hospital admission and pri-mary care data 6 (3) from death records only and one fromboth hospital admission data and death records (table 2 andfigure e-2 dryadorg105061dryadw9ghx3fk0) For thelarger subset of all UKB participants with linkage to the rel-evant data sources these proportions were similar for Scot-land but a higher proportion of participants in England andWales had hospital-only codes (table 2)

As regards stroke types 131225 (58) of stroke-coded caseshad a stroke type-specific code (ischemic ICH or SAH)while the remaining 94 (42) had unspecified stroke codesThe proportion of cases with unspecified stroke type codeswas higher among those ascertained from primary care (105151 [70]) than from hospital admission data (38132[29]) and all death record codes were stroke-type specific(for these code-source level estimates a case would becounted ge1 if occurring in ge1 source) (table 3 and figure e-2dryadorg105061dryadw9ghx3fk0) Among the largersubset of all UKB participants with relevant linked data theproportion of cases with an unspecified stroke code wassimilar for Scotland but lower for England and Wales (19and 30 respectively) (table 3)

Table 1 Participant characteristics

All coded cases (n = 225) True-positive cases (n = 178) False-positive cases (n = 47)

Female 49 (111225) 47 (84178) 57 (2747)

Recruitment age y 63 (41ndash70) 63 (41ndash70) 64 (46ndash69)

Age at code y 67 (45ndash76) 67 (45ndash76) 67 (52ndash76)

Primary care code only 33 (75225) 26 (47178) 60 (2847)

Unspecified stroke code 42 (86225) 38 (68178) 38 (1847)

Hemorrhagic stroke code 22 (49225) 20 (35178) 30 (1447)

Code date lt2011 34 (76225) 33 (58178) 38 (1847)

Dead by end of follow-up (any cause) 8 (18225) 7 (13178) 11 (547)

Values are (n) or median (range)

Table 2 Code sources in the validation study subpopulation and in the larger subset of all UK Biobank (UKB) participantslinked to relevant health care datasets in Scotland England and Wales

UKB participants ScotlandLothian (n = 17249)

UKB participantsScotland alla (n = 31426)

UKB participantsEngland alla (n = 18494)

UKB participantsWales alla (n = 21346)

Total stroke codes 225 553 197 301

Hospital only codes 30 (67225) 27 (147553) 42 (82197) 43 (129301)

Primary care only codes 39 (87225) 44 (241553) 29 (58197) 39 (116301)

Death record only codes 3 (6225) 1 (7553) 0 (0197) 3 (10301)

Hospital and primary carecodes

28 (64225) 28 (156553) 29 (57197) 14 (43301)

Hospital and death recordcodes

0 (1225) 0 (2553) 0 (0197) 1 (2301)

Primary care and deathrecord codes

0 (0225) 0 (0553) 0 (0197) 0 (0301)

Hospital and primary careand death record codes

0 (0225) 0 (0553) 0 (0197) 0 (1301)

Values are (n)a We included UKB participants with linkage to hospital admissions data death records and primary care data (whereas around half of UKB participants arelinked to primary care data using Read V2 or V3 coding systems we included only practices using the Read V2 coding system for ease of comparison)

NeurologyorgN Neurology | Volume 95 Number 6 | August 11 2020 e701

Interadjudicator agreementAll vignettes were independently reviewed by 2 adjudicators Forassigning a diagnosis of stroke vs not interadjudicator agreementwas 91 and Cohen kappa very good at 07 (95 CI 06ndash08)for strokeTIA vs not agreementwas 93 andCohen kappa verygood at 07 (95 CI 06ndash09) Agreement for assigning a stroketype (ischemic stroke vs ICH vs SAH vs uncertain stroke type)was 99 and Cohen kappa excellent at 098 (95 CI 09ndash1)

Code accuracyThe overall PPV of case ascertainment for all 225 stroke-coded cases from all sources combined was 79 (95 CI73ndash84) When broken down by code source PPV washighest for the 132 cases with hospital admission codes (8995 CI 82ndash94) and higher still when limiting analyses toprimary position hospital codes only (94 95 CI88ndash98) at the expense of losing a few (lt10) true-positive cases PPV for the 151 cases with primary care codeswas 80 (95 CI 72ndash86) Only 7 cases had death recordcodes with PPV 57 (95 CI 18ndash90) the wide CIsindicating limited precision of this estimate For cases withboth a hospital admission and a primary care code PPV wasvery high (97 95 CI 89ndash100) but only a third of thecases fell into this category (figure 2A)

As regards the accuracy of identifying stroke pathologic type(for these estimates a case would be counted in ge1 analysis if ithad ge1 unique stroke type code) PPV among the 88 partic-ipants with an ischemic stroke code was 83 (95 CI74ndash90) restricting to primary position codes (for thosecases from hospital admission data) increased this to 86 (95CI 76ndash93) with loss of lt10 of true-positive cases (figure2B) Given that ischemic stroke is the most common patho-logic stroke type and hence an unspecified stroke code is muchmore likely to signify an ischemic rather than a hemorrhagic

stroke case we calculated the PPV for ischemic stroke of the184 ischemic and unspecified stroke codes combined Whilethis resulted in a slightly lower PPV of 80 (95 CI73ndash85) it approximately doubled the number of true-positive ischemic stroke cases identified The proportion oftrue-positive ischemic stroke cases among all cases with anunspecified code was 77 Accuracy of hemorrhagic strokecodes was lower but the small numbers of coded cases resultedin limited precision PPV among 26 participants with an ICHcode was 42 (95 CI 23ndash63) and among 24 participantswith a SAH code was 71 (95 CI 49ndash87) Restricting toprimary position codes increased the PPV of both ICH andSAH codes without losing any true-positive cases

PPV when measuring only administrative accuracy was 93 de-creasing to 79 when measuring overall accuracy This demon-strates the importance of expert-led adjudication in validationstudies as ignoring it would lead to falsely inflated PPVs (figure 3)

Further exploration of false-positive codesThe 47 false-positive stroke-coded cases had a similar pro-portion of unspecified codes and median age (at date of re-cruitment or first stroke code) as true-positives False-positivecases were more likely to have a primary care code and to havedied by the end of follow-up and were slightly more likely tobe female to have a hemorrhagic stroke code and to have anearlier first stroke code date (table 1)

Alternative diagnoses for the 47 false-positives included TIA(1447 30) secondary intracranial bleed not due to SAH orICH (eg traumatic tumor-related bleeding into a cerebralinfarct) (1147 [23]) radiologic finding of a suspected oldvascular lesion (usually incidental) (847 [17]) and pos-sible or probable alternative diagnosis (eg migraine de-myelination seizure or another diagnosis) (1447 [30])

Table 3 Proportion of cases with unspecified vs specified stroke type codes in the validation study subpopulation and inthe larger subset of all UKBiobank (UKB) participants linked to relevant health care datasets in Scotland Englandand Wales

UKB participants ScotlandLothian (n = 17249)

UKB participantsScotland alla (n = 31426)

UKB participants Englandalla (n = 18494)

UKB participants Walesalla (n = 21346)

Total stroke codes 225 553 197 301

Unspecified codes inhospital data

29 (38132) 27 (81305) 7 (10139) 17 (30175)

Unspecified codes inprimary care

70 (105151) 64 (254397) 45 (52115) 50 (80160)

Unspecified codes indeath records

0 (07) 11 (19) na 54 (713)

Unspecified codestotalb

42 (94225) 40 (219553) 19 (37197) 30 (89301)

Values are (n)a We included UKB participants with linkage to hospital admissions data death records and primary care data (whereas around half of UKB participants arelinked to primary care data using Read V2 or V3 coding systems we included only practices using the Read V2 coding system for ease of comparison)b Stroke-coded cases with an unspecified stroke type code (note that if a case has codes from multiple sources including a specified stroke type code inhospital data and an unspecified stroke type code in primary care data or death record data they will count as having a specified code)

e702 Neurology | Volume 95 Number 6 | August 11 2020 NeurologyorgN

Among the 15 participants with a false-positive ICH code 715had another primary stroke type plusmn secondary ICH 415intratumor bleed 315 traumatic or other intracranial bleedand 115 no obvious reason for an ICH code Among the 7participants with a false-positive SAH code 47 were traumaticintracranial bleed 27 another primary stroke type plusmn secondarySAH component and 17 asymptomatic aneurysm

Because performance for hemorrhagic stroke codes wassuboptimal having observed that the more common alter-native diagnoses were traumatic and intratumor bleeds orstroke complicated by a bleed we performed additionalanalyses to explore the effects of different code inclusioncriteria on numbers of cases numbers of true-positives andPPV for ICH and SAH For both ICH and SAH we couldincrease PPV but at the expense of failing to detect some true-positive cases However numbers of cases were too small todraw firm conclusions (figure 4)

Stroke type and subtype distributionsWith information from the EPR a stroke specialist was able toassign a pathologic stroke type to 177 of the 178 true-positivestroke cases 149 (84) were ischemic 11 (6) ICH and 17(10) SAH Depending on the subclassification system usedan ischemic subtype could be determined for 37ndash87 ofcases an ICH subtype for 27ndash100 of cases and SAHsubtype for 94 of cases (figure e-3 dryadorg105061dryadw9ghx3fk0)

DiscussionOur results suggest that stroke cases and cases of ischemicstroke type can be ascertained in UKB through linked codeddata with sufficient accuracy for use in many genetic andepidemiologic research studies without further expert vali-dation since PPVs were generally at least 80 despite

Figure 2 Positive predictive values (PPVs) of stroke codes

PPVs of stroke codes stratified by code source (A) and code type (B) Primary position includes primary care codes where no code position is specified andonly primary position hospital admission codes CI = confidence interval ICH = intracerebral hemorrhage SAH = subarachnoid hemorrhage

NeurologyorgN Neurology | Volume 95 Number 6 | August 11 2020 e703

stringent adjudication criteria Primary care is an importantcode source with ge13 of the cases ascertained exclusivelyvia primary care data in this UK setting Code accuracyappeared slightly better for hospital admission compared toprimary care codes There were insufficient data to draw firmconclusions regarding the accuracy of death record orhemorrhagic stroke type codes Including only the primaryposition codes from hospital admission data increased PPVwithout loss of a substantial proportion of true-positivecases In the subpopulation of Scotland studied only around60 of codes were specific for a stroke type however usingall available relevant medical information from the EPR anexpert adjudicator could assign a stroke type to 99 anda more detailed stroke subtype to between a third and almost90 of cases depending on the subclassification systemused A higher proportion of participants had a specifiedstroke type code in the English and Welsh data but theaccuracy of these codes needs further investigation We alsodemonstrated that at least in this setting considering onlythe administrative accuracy of stroke diagnosis codes (ratherthan overall accuracy based on an expert review of the fullEPR) may give falsely inflated PPVs

Acceptable levels of accuracy and the relative importanceof different accuracy metrics depend on the context8 UKBis primarily used for research into the genetic and non-genetic determinants of disease3 In such analyses it isimportant to ensure that a high proportion of participantsidentified as disease cases truly do have the disease (high

PPV) to minimize bias in effect estimates while aiming tooptimize statistical power by ascertaining as many true-positive cases as possible (optimizing sensitivity but notnecessarily maximizing it as this may compromise thePPV) A high specificity (the proportion of participantswithout the disease that do not receive a stroke code) iscrucial in obtaining a high PPV but is not sufficient in andof itself In population-based prospective cohorts where theproportion of all participants who are true-positives forthe disease (in this case stroke) is generally low the pro-portion of all participants incorrectly classified as havingstroke (false-positives) will generally be low (giving highspecificity) even if the absolute number of false-positives ishigh compared to the absolute number of true-positives(low PPV)8 Providing appropriate codes are used boththe specificity and negative predictive value of routinelycollected health care data to identify disease casesin population-based studies are usually very high(96ndash100)4 For these reasons we focused on estimatingthe PPV of using routinely collected health care data toidentify stroke cases in UKB and on assessing the effects ofdifferent code and source selections on both PPV andnumber of true-positive cases

Strengths of this study include (1) the overall large number ofparticipants (2) inclusion of primary care and death recordcodes for which accuracy data have previously been limited(3) creation of vignettes using preset criteria from the EPR for97 of otherwise eligible stroke-coded cases thus avoiding

Figure 3 Assessing administrative vs overall accuracy

High clinical certainty mentions of stroke ldquostrokerdquo ldquoprobable strokerdquo ldquopresumptive strokerdquo ldquoconsistent with strokerdquo ldquocompatible with strokerdquo ldquolikely strokerdquoldquotreated as strokerdquo or equivalent stroke terms (ICH SAH intracerebral hemorrhage subarachnoid hemorrhage ischemic stroke infarct) Medium clinicalcertainty mentions of stroke including above plus ldquopossible strokerdquo ldquosuspected strokerdquo ldquoimpression of strokerdquo ldquosuggestive of strokerdquo ldquoquery strokerdquo orequivalent stroke terms (ICH SAH intracerebral hemorrhage subarachnoid hemorrhage ischemic stroke infarct) Low clinical certainty mentions of strokeinclude above plus ldquoTIArdquo and ldquotransient ischaemic attackrdquo with any level of certainty preceding it The hierarchy of clinical certainty was based on the ICD-10clinical coding instruction manual (isdscotlandorgProducts-and-ServicesTerminology-ServicesClinical-Coding- Guidelines April 2010 version) which isused by the coding departments in UK hospitals CI = confidence interval PPV = positive predictive value

e704 Neurology | Volume 95 Number 6 | August 11 2020 NeurologyorgN

the bias of selecting participants with a higher prior proba-bility of having had a stroke (4) blinding of adjudicators tocodes and each otherrsquos diagnoses (5) assessment of inter-adjudicator reliability (6) further analysis of false-positivecodes and (7) assessment of both administrative and overallaccuracy demonstrating the importance of an expert-ledreference standard in validation studies

There are some limitations (1) small numbers precludedrobust conclusions about the accuracy of death record andhemorrhagic stroke codes (2) our lack of access to the fullfree text EPRs held exclusively by the primary care systemmay have led to slightly more conservative PPV estimates(although we would not expect this to be a major issue forstroke) as almost all stroke cases have acute inpatient oroutpatient management in secondary care (in keeping withthis is that we were able to find secondary care records for97 of the cases) and information provided from primaryto secondary care was available in the hospital EPR (3) thepotential difficulty in differentiating between definite false-positive and uncertain cases from retrospective review ofmedical records which may have further reduced the es-timated PPVs and (4) the restriction of the current studyto a Scottish subpopulation of the UKB participants (al-though we have no reason to suspect substantial variationin the results considering that the code generation processacross the United Kingdom follows similar pathways andgiven the broadly similar distribution of code sources in

the wider UKB population across England Scotland andWales)

The present study is limited to validating the accuracy of codesfor symptomatic stroke (as defined by the WHO) Subclinicalcerebrovascular disease (such as that detected by brain imaging)which is more common particularly in older people9 would notbe readily ascertained by the linked coded health care datasources used to follow the health of UKB participants but can bedetected through brain imaging conducted as part of the UKBmultimodal imaging study of 100000 of its participants

These results are largely consistent with those from our earliersystematic review4 where PPVs for stroke-specific ICD-10hospital admission codes ranged from 79 to 831011 andfor ischemic stroke from 86 to 88101213 Our results forICH and SAH code accuracy among a small number of par-ticipants were worse than those from the 2 previous UnitedKingdomndashbased studies of ICD-10 or Read codes1314 Thismay be in part because these studies checked only (or mainly)for administrative accuracy hence inflating the PPVs Ourresults were similar to a Danish study which also used anexpert review of medical records to check for overall accu-racy10 To our knowledge there are no previous validationstudies of stroke-specific UK primary care Read codes ordeath record codes for diagnosis of all types of stroke nor ofthe effect of combining primary care and hospital admissioncoded data

Figure 4 Exploratory analyses to improve accuracy of hemorrhagic stroke codes

Excluding other stroke code sameday excluded cases with a diagnostic code for gt1 stroke pathologic type on the sameday This was done because a patientwho has one pathologic stroke type (eg ischemic stroke) can sometimes develop a complication and subsequent brain scan appearances similar to anotherpathologic stroke type (eg a patient with ischemic stroke can have a bleed in the brain as a result of the ischemic stroke which could lead to a false diagnosisof an intracerebral hemorrhage [ICH]) CI = confidence interval PPV = positive predictive value SAH = subarachnoid hemorrhage

NeurologyorgN Neurology | Volume 95 Number 6 | August 11 2020 e705

Further work will be required to further explore the accuracyof hemorrhagic stroke and death record codes in largernumbers of cases by expanding this work to other UKB re-cruitment locations which will also enable assessment of thegeneralizability of these results to other regions of the UnitedKingdom In our Scottish subpopulation while over one-thirdof the codes were not specific for a stroke type expert adju-dication allowed a stroke type to be assigned for almost allcases Future studies should investigate ways of automatingthis process as well as methods for determining more detailedstroke subtypes

Our results suggest that stroke and ischemic stroke cases inUKB can be identified through linked coded data withsufficient accuracy for many genetic and epidemiologicstudies while there are currently insufficient data to drawdefinite conclusions about the accuracy of hemorrhagicstroke codes

AcknowledgmentThe authors thank the UK Biobank participants UK Biobankscientific project and data management teams in OxfordStockport and Edinburgh and the UK Biobank strokeoutcomes working group (Oxford University Valerie BeralYiping Chen Zhengming Chen Jane Green Mary KrollSarah Lewington Peter Rothwell Lucy Wright EdinburghUniversity Martin Dennis Joanna Wardlaw Sarah Wild)This research was awarded the University of EdinburghArthur Fonville prize

Study fundingThis work was supported by salary from UK Biobank andHealth Data Research UK (HDRUK) fellowship (KR)salary from Dementias Platform UK (DPUK) clinical fel-lowship (KB) salary from Medical Research Council Clin-ical Research Training Fellowship (MRP0018231) (TW)salary from UK Biobank and DPUK (RF CS) salary fromUK Biobank (KW RW QZ) and salary from UK Bio-bank Scottish Funding Council DPUK and HDRUK(CLMS) (Chief Scientist for UK Biobank) Project man-agement and administration funding provided by UKBiobank

DisclosureK Rannikmae K Ngoh K Bush R Salman F Doubal RFlaig D Henshall A Hutchison J Nolan S Osborne NSamarasekera C Schnier W Whiteley T Wilkinson KWilson R Woodfield and Q Zhang report no disclosuresrelevant to the manuscript N Allen was senior epidemiologistwhen this study was conducted C Sudlow was chief scientistfor the UK Biobank study when this study was conducted Goto NeurologyorgN for full disclosures

Publication historyReceived by Neurology July 4 2019 Accepted in final formJanuary 27 2020

Appendix Authors

Name Location Contribution

KristiinaRannikmaePhD

UniversityofEdinburgh

Substantial contributions to theconception and design of the workacquisition analysis andinterpretation of data

Kenneth Ngoh UniversityofEdinburgh

Substantial contributions toacquisition analysis andinterpretation of data

Kathryn Bush UniversityofEdinburgh

Substantial contributions to theanalysis and interpretation of data

Rustam Al-Shahi SalmanPhD

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

Fergus DoubalPhD

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

Robin Flaig UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

David EHenshall

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

AidanHutchison

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

John Nolan UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

Scott Osborne UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

NeshikaSamarasekeraPhD

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

ChristianSchnier PhD

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

Will WhiteleyPhD

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

Tim Wilkinson UniversityofEdinburgh

Substantial contributions to theconception and design of the work andinterpretation of data

Kirsty Wilson UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

RebeccaWoodfield PhD

UniversityofEdinburgh

Substantial contributions to theconception of the work andinterpretation of data

Qiuli ZhangPhD

UniversityofEdinburgh

Substantial contributions to theconception and design acquisitionanalysis and interpretation of data

Naomi AllenPhD

Universityof Oxford

Substantial contributions to theacquisition and interpretation of data

Cathie LMSudlow PhD

UniversityofEdinburgh

Substantial contributions to theconception and design of the workacquisition and interpretation of data

e706 Neurology | Volume 95 Number 6 | August 11 2020 NeurologyorgN

References1 Lozano R Naghavi M Foreman K et al Global and regional mortality from 235

causes of death for 20 age groups in 1990 and 2010 a systematic analysis for theGlobal Burden of Disease Study 2010 Lancet 20123802095ndash2128

2 Burton PR Hansell AL Fortier I et al Size matters just how big is BIG quantifyingrealistic sample size requirements for human genome epidemiology Int J Epidemiol2009263ndash273

3 Sudlow C Gallacher J Allen N et al UK Biobank an open access resource foridentifying the causes of a wide range of complex diseases of middle and old age PLoSMed 201512e1001779

4 Woodfield R Grant I UK Biobank Stroke Outcomes Group UK Biobank Follow-Upand OutcomesWorking Group Sudlow CL Accuracy of electronic health record datafor identifying stroke cases in large-scale epidemiological studies a systematic reviewfrom the UK Biobank stroke outcomes group PLoS One 201510e0140533

5 WHO MONICA Project Investigators The World Health Organization MONICAproject (Monitoring Trends and Determinants in Cardiovascular Disease) J ClinEpidemiol 198841105ndash114

6 Ford DV Jones KH Verplancke JP et al The SAIL Databank building a nationalarchitecture for e-health research and evaluation BMC Health Serv Res 20099157

7 Viera AJ Garrett JM Understanding interobserver agreement the kappa statisticFam Med 200537360ndash363

8 Chubak J Pocobelli G Weiss NS Tradeoffs between accuracy measures for electronichealth care data algorithms J Clin Epidemiol 201265343ndash349e2

9 KirkmanMAMahattanakul W Gregson BAMendelow AD The accuracy of hospitaldischarge coding for hemorrhagic stroke Acta Neurol Belg 2009109114ndash119

10 Johnsen SP Overvad K Soslashrensen HT Tjoslashnneland A Husted SE Predictive value ofstroke and transient ischemic attack discharge diagnoses in the Danish NationalRegistry of Patients J Clin Epidemiol 200255602ndash607

11 Krarup LH Boysen G Janjua H Prescott E Truelsen T Validity of stroke diagnosesin a national register of patients Neuroepidemiology 200728150ndash154

12 Haesebaert J Termoz A Polazzi S et al Can hospital discharge databases be used tofollow ischemic stroke incidence Stroke 2013441770ndash1774

13 Wright FL Green J Canoy D et al Vascular disease in women comparison ofdiagnoses in hospital episode statistics and general practice records in England BMCMed Res Methodol 201212161

14 Gaist D Wallander MA Gonzalez-Perez A Garcia-Rodriguez L Incidence of hem-orrhagic stroke in the general population validation of data from the Health Im-provement Network Pharmacoepidemiol Drug Saf 201322176ndash182

NeurologyorgN Neurology | Volume 95 Number 6 | August 11 2020 e707

DOI 101212WNL0000000000009924202095e697-e707 Published Online before print July 2 2020Neurology

Kristiina Rannikmaumle Kenneth Ngoh Kathryn Bush et al Biobank

Accuracy of identifying incident stroke cases from linked health care data in UK

This information is current as of July 2 2020

ServicesUpdated Information amp

httpnneurologyorgcontent956e697fullincluding high resolution figures can be found at

References httpnneurologyorgcontent956e697fullref-list-1

This article cites 13 articles 1 of which you can access for free at

Citations httpnneurologyorgcontent956e697fullotherarticles

This article has been cited by 1 HighWire-hosted articles

Subspecialty Collections

httpnneurologyorgcgicollectionsubarachnoid_hemorrhageSubarachnoid hemorrhage

httpnneurologyorgcgicollectionintracerebral_hemorrhageIntracerebral hemorrhage

httpnneurologyorgcgicollectioninfarctionInfarction

httpnneurologyorgcgicollectioncohort_studiesCohort studies e

httpnneurologyorgcgicollectionall_cerebrovascular_disease_strokAll Cerebrovascular diseaseStrokefollowing collection(s) This article along with others on similar topics appears in the

Permissions amp Licensing

httpwwwneurologyorgaboutabout_the_journalpermissionsits entirety can be found online atInformation about reproducing this article in parts (figurestables) or in

Reprints

httpnneurologyorgsubscribersadvertiseInformation about ordering reprints can be found online

ISSN 0028-3878 Online ISSN 1526-632XWolters Kluwer Health Inc on behalf of the American Academy of Neurology All rights reserved Print1951 it is now a weekly with 48 issues per year Copyright Copyright copy 2020 The Author(s) Published by

reg is the official journal of the American Academy of Neurology Published continuously sinceNeurology

Page 5: Accuracyofidentifyingincidentstrokecasesfrom linked health ... · UK Biobank (UKB) is a prospective population-based cohort study with extensive phenotypic and genotypic information

years) and at time of the stroke code was 67 years (range45ndash76 years) (table 1)

Code sources and typesOf the 225 stroke-coded cases 67 (30) received a code fromhospital admission data only 87 (39) from primary caredata only 64 (28) from both hospital admission and pri-mary care data 6 (3) from death records only and one fromboth hospital admission data and death records (table 2 andfigure e-2 dryadorg105061dryadw9ghx3fk0) For thelarger subset of all UKB participants with linkage to the rel-evant data sources these proportions were similar for Scot-land but a higher proportion of participants in England andWales had hospital-only codes (table 2)

As regards stroke types 131225 (58) of stroke-coded caseshad a stroke type-specific code (ischemic ICH or SAH)while the remaining 94 (42) had unspecified stroke codesThe proportion of cases with unspecified stroke type codeswas higher among those ascertained from primary care (105151 [70]) than from hospital admission data (38132[29]) and all death record codes were stroke-type specific(for these code-source level estimates a case would becounted ge1 if occurring in ge1 source) (table 3 and figure e-2dryadorg105061dryadw9ghx3fk0) Among the largersubset of all UKB participants with relevant linked data theproportion of cases with an unspecified stroke code wassimilar for Scotland but lower for England and Wales (19and 30 respectively) (table 3)

Table 1 Participant characteristics

All coded cases (n = 225) True-positive cases (n = 178) False-positive cases (n = 47)

Female 49 (111225) 47 (84178) 57 (2747)

Recruitment age y 63 (41ndash70) 63 (41ndash70) 64 (46ndash69)

Age at code y 67 (45ndash76) 67 (45ndash76) 67 (52ndash76)

Primary care code only 33 (75225) 26 (47178) 60 (2847)

Unspecified stroke code 42 (86225) 38 (68178) 38 (1847)

Hemorrhagic stroke code 22 (49225) 20 (35178) 30 (1447)

Code date lt2011 34 (76225) 33 (58178) 38 (1847)

Dead by end of follow-up (any cause) 8 (18225) 7 (13178) 11 (547)

Values are (n) or median (range)

Table 2 Code sources in the validation study subpopulation and in the larger subset of all UK Biobank (UKB) participantslinked to relevant health care datasets in Scotland England and Wales

UKB participants ScotlandLothian (n = 17249)

UKB participantsScotland alla (n = 31426)

UKB participantsEngland alla (n = 18494)

UKB participantsWales alla (n = 21346)

Total stroke codes 225 553 197 301

Hospital only codes 30 (67225) 27 (147553) 42 (82197) 43 (129301)

Primary care only codes 39 (87225) 44 (241553) 29 (58197) 39 (116301)

Death record only codes 3 (6225) 1 (7553) 0 (0197) 3 (10301)

Hospital and primary carecodes

28 (64225) 28 (156553) 29 (57197) 14 (43301)

Hospital and death recordcodes

0 (1225) 0 (2553) 0 (0197) 1 (2301)

Primary care and deathrecord codes

0 (0225) 0 (0553) 0 (0197) 0 (0301)

Hospital and primary careand death record codes

0 (0225) 0 (0553) 0 (0197) 0 (1301)

Values are (n)a We included UKB participants with linkage to hospital admissions data death records and primary care data (whereas around half of UKB participants arelinked to primary care data using Read V2 or V3 coding systems we included only practices using the Read V2 coding system for ease of comparison)

NeurologyorgN Neurology | Volume 95 Number 6 | August 11 2020 e701

Interadjudicator agreementAll vignettes were independently reviewed by 2 adjudicators Forassigning a diagnosis of stroke vs not interadjudicator agreementwas 91 and Cohen kappa very good at 07 (95 CI 06ndash08)for strokeTIA vs not agreementwas 93 andCohen kappa verygood at 07 (95 CI 06ndash09) Agreement for assigning a stroketype (ischemic stroke vs ICH vs SAH vs uncertain stroke type)was 99 and Cohen kappa excellent at 098 (95 CI 09ndash1)

Code accuracyThe overall PPV of case ascertainment for all 225 stroke-coded cases from all sources combined was 79 (95 CI73ndash84) When broken down by code source PPV washighest for the 132 cases with hospital admission codes (8995 CI 82ndash94) and higher still when limiting analyses toprimary position hospital codes only (94 95 CI88ndash98) at the expense of losing a few (lt10) true-positive cases PPV for the 151 cases with primary care codeswas 80 (95 CI 72ndash86) Only 7 cases had death recordcodes with PPV 57 (95 CI 18ndash90) the wide CIsindicating limited precision of this estimate For cases withboth a hospital admission and a primary care code PPV wasvery high (97 95 CI 89ndash100) but only a third of thecases fell into this category (figure 2A)

As regards the accuracy of identifying stroke pathologic type(for these estimates a case would be counted in ge1 analysis if ithad ge1 unique stroke type code) PPV among the 88 partic-ipants with an ischemic stroke code was 83 (95 CI74ndash90) restricting to primary position codes (for thosecases from hospital admission data) increased this to 86 (95CI 76ndash93) with loss of lt10 of true-positive cases (figure2B) Given that ischemic stroke is the most common patho-logic stroke type and hence an unspecified stroke code is muchmore likely to signify an ischemic rather than a hemorrhagic

stroke case we calculated the PPV for ischemic stroke of the184 ischemic and unspecified stroke codes combined Whilethis resulted in a slightly lower PPV of 80 (95 CI73ndash85) it approximately doubled the number of true-positive ischemic stroke cases identified The proportion oftrue-positive ischemic stroke cases among all cases with anunspecified code was 77 Accuracy of hemorrhagic strokecodes was lower but the small numbers of coded cases resultedin limited precision PPV among 26 participants with an ICHcode was 42 (95 CI 23ndash63) and among 24 participantswith a SAH code was 71 (95 CI 49ndash87) Restricting toprimary position codes increased the PPV of both ICH andSAH codes without losing any true-positive cases

PPV when measuring only administrative accuracy was 93 de-creasing to 79 when measuring overall accuracy This demon-strates the importance of expert-led adjudication in validationstudies as ignoring it would lead to falsely inflated PPVs (figure 3)

Further exploration of false-positive codesThe 47 false-positive stroke-coded cases had a similar pro-portion of unspecified codes and median age (at date of re-cruitment or first stroke code) as true-positives False-positivecases were more likely to have a primary care code and to havedied by the end of follow-up and were slightly more likely tobe female to have a hemorrhagic stroke code and to have anearlier first stroke code date (table 1)

Alternative diagnoses for the 47 false-positives included TIA(1447 30) secondary intracranial bleed not due to SAH orICH (eg traumatic tumor-related bleeding into a cerebralinfarct) (1147 [23]) radiologic finding of a suspected oldvascular lesion (usually incidental) (847 [17]) and pos-sible or probable alternative diagnosis (eg migraine de-myelination seizure or another diagnosis) (1447 [30])

Table 3 Proportion of cases with unspecified vs specified stroke type codes in the validation study subpopulation and inthe larger subset of all UKBiobank (UKB) participants linked to relevant health care datasets in Scotland Englandand Wales

UKB participants ScotlandLothian (n = 17249)

UKB participantsScotland alla (n = 31426)

UKB participants Englandalla (n = 18494)

UKB participants Walesalla (n = 21346)

Total stroke codes 225 553 197 301

Unspecified codes inhospital data

29 (38132) 27 (81305) 7 (10139) 17 (30175)

Unspecified codes inprimary care

70 (105151) 64 (254397) 45 (52115) 50 (80160)

Unspecified codes indeath records

0 (07) 11 (19) na 54 (713)

Unspecified codestotalb

42 (94225) 40 (219553) 19 (37197) 30 (89301)

Values are (n)a We included UKB participants with linkage to hospital admissions data death records and primary care data (whereas around half of UKB participants arelinked to primary care data using Read V2 or V3 coding systems we included only practices using the Read V2 coding system for ease of comparison)b Stroke-coded cases with an unspecified stroke type code (note that if a case has codes from multiple sources including a specified stroke type code inhospital data and an unspecified stroke type code in primary care data or death record data they will count as having a specified code)

e702 Neurology | Volume 95 Number 6 | August 11 2020 NeurologyorgN

Among the 15 participants with a false-positive ICH code 715had another primary stroke type plusmn secondary ICH 415intratumor bleed 315 traumatic or other intracranial bleedand 115 no obvious reason for an ICH code Among the 7participants with a false-positive SAH code 47 were traumaticintracranial bleed 27 another primary stroke type plusmn secondarySAH component and 17 asymptomatic aneurysm

Because performance for hemorrhagic stroke codes wassuboptimal having observed that the more common alter-native diagnoses were traumatic and intratumor bleeds orstroke complicated by a bleed we performed additionalanalyses to explore the effects of different code inclusioncriteria on numbers of cases numbers of true-positives andPPV for ICH and SAH For both ICH and SAH we couldincrease PPV but at the expense of failing to detect some true-positive cases However numbers of cases were too small todraw firm conclusions (figure 4)

Stroke type and subtype distributionsWith information from the EPR a stroke specialist was able toassign a pathologic stroke type to 177 of the 178 true-positivestroke cases 149 (84) were ischemic 11 (6) ICH and 17(10) SAH Depending on the subclassification system usedan ischemic subtype could be determined for 37ndash87 ofcases an ICH subtype for 27ndash100 of cases and SAHsubtype for 94 of cases (figure e-3 dryadorg105061dryadw9ghx3fk0)

DiscussionOur results suggest that stroke cases and cases of ischemicstroke type can be ascertained in UKB through linked codeddata with sufficient accuracy for use in many genetic andepidemiologic research studies without further expert vali-dation since PPVs were generally at least 80 despite

Figure 2 Positive predictive values (PPVs) of stroke codes

PPVs of stroke codes stratified by code source (A) and code type (B) Primary position includes primary care codes where no code position is specified andonly primary position hospital admission codes CI = confidence interval ICH = intracerebral hemorrhage SAH = subarachnoid hemorrhage

NeurologyorgN Neurology | Volume 95 Number 6 | August 11 2020 e703

stringent adjudication criteria Primary care is an importantcode source with ge13 of the cases ascertained exclusivelyvia primary care data in this UK setting Code accuracyappeared slightly better for hospital admission compared toprimary care codes There were insufficient data to draw firmconclusions regarding the accuracy of death record orhemorrhagic stroke type codes Including only the primaryposition codes from hospital admission data increased PPVwithout loss of a substantial proportion of true-positivecases In the subpopulation of Scotland studied only around60 of codes were specific for a stroke type however usingall available relevant medical information from the EPR anexpert adjudicator could assign a stroke type to 99 anda more detailed stroke subtype to between a third and almost90 of cases depending on the subclassification systemused A higher proportion of participants had a specifiedstroke type code in the English and Welsh data but theaccuracy of these codes needs further investigation We alsodemonstrated that at least in this setting considering onlythe administrative accuracy of stroke diagnosis codes (ratherthan overall accuracy based on an expert review of the fullEPR) may give falsely inflated PPVs

Acceptable levels of accuracy and the relative importanceof different accuracy metrics depend on the context8 UKBis primarily used for research into the genetic and non-genetic determinants of disease3 In such analyses it isimportant to ensure that a high proportion of participantsidentified as disease cases truly do have the disease (high

PPV) to minimize bias in effect estimates while aiming tooptimize statistical power by ascertaining as many true-positive cases as possible (optimizing sensitivity but notnecessarily maximizing it as this may compromise thePPV) A high specificity (the proportion of participantswithout the disease that do not receive a stroke code) iscrucial in obtaining a high PPV but is not sufficient in andof itself In population-based prospective cohorts where theproportion of all participants who are true-positives forthe disease (in this case stroke) is generally low the pro-portion of all participants incorrectly classified as havingstroke (false-positives) will generally be low (giving highspecificity) even if the absolute number of false-positives ishigh compared to the absolute number of true-positives(low PPV)8 Providing appropriate codes are used boththe specificity and negative predictive value of routinelycollected health care data to identify disease casesin population-based studies are usually very high(96ndash100)4 For these reasons we focused on estimatingthe PPV of using routinely collected health care data toidentify stroke cases in UKB and on assessing the effects ofdifferent code and source selections on both PPV andnumber of true-positive cases

Strengths of this study include (1) the overall large number ofparticipants (2) inclusion of primary care and death recordcodes for which accuracy data have previously been limited(3) creation of vignettes using preset criteria from the EPR for97 of otherwise eligible stroke-coded cases thus avoiding

Figure 3 Assessing administrative vs overall accuracy

High clinical certainty mentions of stroke ldquostrokerdquo ldquoprobable strokerdquo ldquopresumptive strokerdquo ldquoconsistent with strokerdquo ldquocompatible with strokerdquo ldquolikely strokerdquoldquotreated as strokerdquo or equivalent stroke terms (ICH SAH intracerebral hemorrhage subarachnoid hemorrhage ischemic stroke infarct) Medium clinicalcertainty mentions of stroke including above plus ldquopossible strokerdquo ldquosuspected strokerdquo ldquoimpression of strokerdquo ldquosuggestive of strokerdquo ldquoquery strokerdquo orequivalent stroke terms (ICH SAH intracerebral hemorrhage subarachnoid hemorrhage ischemic stroke infarct) Low clinical certainty mentions of strokeinclude above plus ldquoTIArdquo and ldquotransient ischaemic attackrdquo with any level of certainty preceding it The hierarchy of clinical certainty was based on the ICD-10clinical coding instruction manual (isdscotlandorgProducts-and-ServicesTerminology-ServicesClinical-Coding- Guidelines April 2010 version) which isused by the coding departments in UK hospitals CI = confidence interval PPV = positive predictive value

e704 Neurology | Volume 95 Number 6 | August 11 2020 NeurologyorgN

the bias of selecting participants with a higher prior proba-bility of having had a stroke (4) blinding of adjudicators tocodes and each otherrsquos diagnoses (5) assessment of inter-adjudicator reliability (6) further analysis of false-positivecodes and (7) assessment of both administrative and overallaccuracy demonstrating the importance of an expert-ledreference standard in validation studies

There are some limitations (1) small numbers precludedrobust conclusions about the accuracy of death record andhemorrhagic stroke codes (2) our lack of access to the fullfree text EPRs held exclusively by the primary care systemmay have led to slightly more conservative PPV estimates(although we would not expect this to be a major issue forstroke) as almost all stroke cases have acute inpatient oroutpatient management in secondary care (in keeping withthis is that we were able to find secondary care records for97 of the cases) and information provided from primaryto secondary care was available in the hospital EPR (3) thepotential difficulty in differentiating between definite false-positive and uncertain cases from retrospective review ofmedical records which may have further reduced the es-timated PPVs and (4) the restriction of the current studyto a Scottish subpopulation of the UKB participants (al-though we have no reason to suspect substantial variationin the results considering that the code generation processacross the United Kingdom follows similar pathways andgiven the broadly similar distribution of code sources in

the wider UKB population across England Scotland andWales)

The present study is limited to validating the accuracy of codesfor symptomatic stroke (as defined by the WHO) Subclinicalcerebrovascular disease (such as that detected by brain imaging)which is more common particularly in older people9 would notbe readily ascertained by the linked coded health care datasources used to follow the health of UKB participants but can bedetected through brain imaging conducted as part of the UKBmultimodal imaging study of 100000 of its participants

These results are largely consistent with those from our earliersystematic review4 where PPVs for stroke-specific ICD-10hospital admission codes ranged from 79 to 831011 andfor ischemic stroke from 86 to 88101213 Our results forICH and SAH code accuracy among a small number of par-ticipants were worse than those from the 2 previous UnitedKingdomndashbased studies of ICD-10 or Read codes1314 Thismay be in part because these studies checked only (or mainly)for administrative accuracy hence inflating the PPVs Ourresults were similar to a Danish study which also used anexpert review of medical records to check for overall accu-racy10 To our knowledge there are no previous validationstudies of stroke-specific UK primary care Read codes ordeath record codes for diagnosis of all types of stroke nor ofthe effect of combining primary care and hospital admissioncoded data

Figure 4 Exploratory analyses to improve accuracy of hemorrhagic stroke codes

Excluding other stroke code sameday excluded cases with a diagnostic code for gt1 stroke pathologic type on the sameday This was done because a patientwho has one pathologic stroke type (eg ischemic stroke) can sometimes develop a complication and subsequent brain scan appearances similar to anotherpathologic stroke type (eg a patient with ischemic stroke can have a bleed in the brain as a result of the ischemic stroke which could lead to a false diagnosisof an intracerebral hemorrhage [ICH]) CI = confidence interval PPV = positive predictive value SAH = subarachnoid hemorrhage

NeurologyorgN Neurology | Volume 95 Number 6 | August 11 2020 e705

Further work will be required to further explore the accuracyof hemorrhagic stroke and death record codes in largernumbers of cases by expanding this work to other UKB re-cruitment locations which will also enable assessment of thegeneralizability of these results to other regions of the UnitedKingdom In our Scottish subpopulation while over one-thirdof the codes were not specific for a stroke type expert adju-dication allowed a stroke type to be assigned for almost allcases Future studies should investigate ways of automatingthis process as well as methods for determining more detailedstroke subtypes

Our results suggest that stroke and ischemic stroke cases inUKB can be identified through linked coded data withsufficient accuracy for many genetic and epidemiologicstudies while there are currently insufficient data to drawdefinite conclusions about the accuracy of hemorrhagicstroke codes

AcknowledgmentThe authors thank the UK Biobank participants UK Biobankscientific project and data management teams in OxfordStockport and Edinburgh and the UK Biobank strokeoutcomes working group (Oxford University Valerie BeralYiping Chen Zhengming Chen Jane Green Mary KrollSarah Lewington Peter Rothwell Lucy Wright EdinburghUniversity Martin Dennis Joanna Wardlaw Sarah Wild)This research was awarded the University of EdinburghArthur Fonville prize

Study fundingThis work was supported by salary from UK Biobank andHealth Data Research UK (HDRUK) fellowship (KR)salary from Dementias Platform UK (DPUK) clinical fel-lowship (KB) salary from Medical Research Council Clin-ical Research Training Fellowship (MRP0018231) (TW)salary from UK Biobank and DPUK (RF CS) salary fromUK Biobank (KW RW QZ) and salary from UK Bio-bank Scottish Funding Council DPUK and HDRUK(CLMS) (Chief Scientist for UK Biobank) Project man-agement and administration funding provided by UKBiobank

DisclosureK Rannikmae K Ngoh K Bush R Salman F Doubal RFlaig D Henshall A Hutchison J Nolan S Osborne NSamarasekera C Schnier W Whiteley T Wilkinson KWilson R Woodfield and Q Zhang report no disclosuresrelevant to the manuscript N Allen was senior epidemiologistwhen this study was conducted C Sudlow was chief scientistfor the UK Biobank study when this study was conducted Goto NeurologyorgN for full disclosures

Publication historyReceived by Neurology July 4 2019 Accepted in final formJanuary 27 2020

Appendix Authors

Name Location Contribution

KristiinaRannikmaePhD

UniversityofEdinburgh

Substantial contributions to theconception and design of the workacquisition analysis andinterpretation of data

Kenneth Ngoh UniversityofEdinburgh

Substantial contributions toacquisition analysis andinterpretation of data

Kathryn Bush UniversityofEdinburgh

Substantial contributions to theanalysis and interpretation of data

Rustam Al-Shahi SalmanPhD

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

Fergus DoubalPhD

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

Robin Flaig UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

David EHenshall

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

AidanHutchison

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

John Nolan UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

Scott Osborne UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

NeshikaSamarasekeraPhD

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

ChristianSchnier PhD

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

Will WhiteleyPhD

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

Tim Wilkinson UniversityofEdinburgh

Substantial contributions to theconception and design of the work andinterpretation of data

Kirsty Wilson UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

RebeccaWoodfield PhD

UniversityofEdinburgh

Substantial contributions to theconception of the work andinterpretation of data

Qiuli ZhangPhD

UniversityofEdinburgh

Substantial contributions to theconception and design acquisitionanalysis and interpretation of data

Naomi AllenPhD

Universityof Oxford

Substantial contributions to theacquisition and interpretation of data

Cathie LMSudlow PhD

UniversityofEdinburgh

Substantial contributions to theconception and design of the workacquisition and interpretation of data

e706 Neurology | Volume 95 Number 6 | August 11 2020 NeurologyorgN

References1 Lozano R Naghavi M Foreman K et al Global and regional mortality from 235

causes of death for 20 age groups in 1990 and 2010 a systematic analysis for theGlobal Burden of Disease Study 2010 Lancet 20123802095ndash2128

2 Burton PR Hansell AL Fortier I et al Size matters just how big is BIG quantifyingrealistic sample size requirements for human genome epidemiology Int J Epidemiol2009263ndash273

3 Sudlow C Gallacher J Allen N et al UK Biobank an open access resource foridentifying the causes of a wide range of complex diseases of middle and old age PLoSMed 201512e1001779

4 Woodfield R Grant I UK Biobank Stroke Outcomes Group UK Biobank Follow-Upand OutcomesWorking Group Sudlow CL Accuracy of electronic health record datafor identifying stroke cases in large-scale epidemiological studies a systematic reviewfrom the UK Biobank stroke outcomes group PLoS One 201510e0140533

5 WHO MONICA Project Investigators The World Health Organization MONICAproject (Monitoring Trends and Determinants in Cardiovascular Disease) J ClinEpidemiol 198841105ndash114

6 Ford DV Jones KH Verplancke JP et al The SAIL Databank building a nationalarchitecture for e-health research and evaluation BMC Health Serv Res 20099157

7 Viera AJ Garrett JM Understanding interobserver agreement the kappa statisticFam Med 200537360ndash363

8 Chubak J Pocobelli G Weiss NS Tradeoffs between accuracy measures for electronichealth care data algorithms J Clin Epidemiol 201265343ndash349e2

9 KirkmanMAMahattanakul W Gregson BAMendelow AD The accuracy of hospitaldischarge coding for hemorrhagic stroke Acta Neurol Belg 2009109114ndash119

10 Johnsen SP Overvad K Soslashrensen HT Tjoslashnneland A Husted SE Predictive value ofstroke and transient ischemic attack discharge diagnoses in the Danish NationalRegistry of Patients J Clin Epidemiol 200255602ndash607

11 Krarup LH Boysen G Janjua H Prescott E Truelsen T Validity of stroke diagnosesin a national register of patients Neuroepidemiology 200728150ndash154

12 Haesebaert J Termoz A Polazzi S et al Can hospital discharge databases be used tofollow ischemic stroke incidence Stroke 2013441770ndash1774

13 Wright FL Green J Canoy D et al Vascular disease in women comparison ofdiagnoses in hospital episode statistics and general practice records in England BMCMed Res Methodol 201212161

14 Gaist D Wallander MA Gonzalez-Perez A Garcia-Rodriguez L Incidence of hem-orrhagic stroke in the general population validation of data from the Health Im-provement Network Pharmacoepidemiol Drug Saf 201322176ndash182

NeurologyorgN Neurology | Volume 95 Number 6 | August 11 2020 e707

DOI 101212WNL0000000000009924202095e697-e707 Published Online before print July 2 2020Neurology

Kristiina Rannikmaumle Kenneth Ngoh Kathryn Bush et al Biobank

Accuracy of identifying incident stroke cases from linked health care data in UK

This information is current as of July 2 2020

ServicesUpdated Information amp

httpnneurologyorgcontent956e697fullincluding high resolution figures can be found at

References httpnneurologyorgcontent956e697fullref-list-1

This article cites 13 articles 1 of which you can access for free at

Citations httpnneurologyorgcontent956e697fullotherarticles

This article has been cited by 1 HighWire-hosted articles

Subspecialty Collections

httpnneurologyorgcgicollectionsubarachnoid_hemorrhageSubarachnoid hemorrhage

httpnneurologyorgcgicollectionintracerebral_hemorrhageIntracerebral hemorrhage

httpnneurologyorgcgicollectioninfarctionInfarction

httpnneurologyorgcgicollectioncohort_studiesCohort studies e

httpnneurologyorgcgicollectionall_cerebrovascular_disease_strokAll Cerebrovascular diseaseStrokefollowing collection(s) This article along with others on similar topics appears in the

Permissions amp Licensing

httpwwwneurologyorgaboutabout_the_journalpermissionsits entirety can be found online atInformation about reproducing this article in parts (figurestables) or in

Reprints

httpnneurologyorgsubscribersadvertiseInformation about ordering reprints can be found online

ISSN 0028-3878 Online ISSN 1526-632XWolters Kluwer Health Inc on behalf of the American Academy of Neurology All rights reserved Print1951 it is now a weekly with 48 issues per year Copyright Copyright copy 2020 The Author(s) Published by

reg is the official journal of the American Academy of Neurology Published continuously sinceNeurology

Page 6: Accuracyofidentifyingincidentstrokecasesfrom linked health ... · UK Biobank (UKB) is a prospective population-based cohort study with extensive phenotypic and genotypic information

Interadjudicator agreementAll vignettes were independently reviewed by 2 adjudicators Forassigning a diagnosis of stroke vs not interadjudicator agreementwas 91 and Cohen kappa very good at 07 (95 CI 06ndash08)for strokeTIA vs not agreementwas 93 andCohen kappa verygood at 07 (95 CI 06ndash09) Agreement for assigning a stroketype (ischemic stroke vs ICH vs SAH vs uncertain stroke type)was 99 and Cohen kappa excellent at 098 (95 CI 09ndash1)

Code accuracyThe overall PPV of case ascertainment for all 225 stroke-coded cases from all sources combined was 79 (95 CI73ndash84) When broken down by code source PPV washighest for the 132 cases with hospital admission codes (8995 CI 82ndash94) and higher still when limiting analyses toprimary position hospital codes only (94 95 CI88ndash98) at the expense of losing a few (lt10) true-positive cases PPV for the 151 cases with primary care codeswas 80 (95 CI 72ndash86) Only 7 cases had death recordcodes with PPV 57 (95 CI 18ndash90) the wide CIsindicating limited precision of this estimate For cases withboth a hospital admission and a primary care code PPV wasvery high (97 95 CI 89ndash100) but only a third of thecases fell into this category (figure 2A)

As regards the accuracy of identifying stroke pathologic type(for these estimates a case would be counted in ge1 analysis if ithad ge1 unique stroke type code) PPV among the 88 partic-ipants with an ischemic stroke code was 83 (95 CI74ndash90) restricting to primary position codes (for thosecases from hospital admission data) increased this to 86 (95CI 76ndash93) with loss of lt10 of true-positive cases (figure2B) Given that ischemic stroke is the most common patho-logic stroke type and hence an unspecified stroke code is muchmore likely to signify an ischemic rather than a hemorrhagic

stroke case we calculated the PPV for ischemic stroke of the184 ischemic and unspecified stroke codes combined Whilethis resulted in a slightly lower PPV of 80 (95 CI73ndash85) it approximately doubled the number of true-positive ischemic stroke cases identified The proportion oftrue-positive ischemic stroke cases among all cases with anunspecified code was 77 Accuracy of hemorrhagic strokecodes was lower but the small numbers of coded cases resultedin limited precision PPV among 26 participants with an ICHcode was 42 (95 CI 23ndash63) and among 24 participantswith a SAH code was 71 (95 CI 49ndash87) Restricting toprimary position codes increased the PPV of both ICH andSAH codes without losing any true-positive cases

PPV when measuring only administrative accuracy was 93 de-creasing to 79 when measuring overall accuracy This demon-strates the importance of expert-led adjudication in validationstudies as ignoring it would lead to falsely inflated PPVs (figure 3)

Further exploration of false-positive codesThe 47 false-positive stroke-coded cases had a similar pro-portion of unspecified codes and median age (at date of re-cruitment or first stroke code) as true-positives False-positivecases were more likely to have a primary care code and to havedied by the end of follow-up and were slightly more likely tobe female to have a hemorrhagic stroke code and to have anearlier first stroke code date (table 1)

Alternative diagnoses for the 47 false-positives included TIA(1447 30) secondary intracranial bleed not due to SAH orICH (eg traumatic tumor-related bleeding into a cerebralinfarct) (1147 [23]) radiologic finding of a suspected oldvascular lesion (usually incidental) (847 [17]) and pos-sible or probable alternative diagnosis (eg migraine de-myelination seizure or another diagnosis) (1447 [30])

Table 3 Proportion of cases with unspecified vs specified stroke type codes in the validation study subpopulation and inthe larger subset of all UKBiobank (UKB) participants linked to relevant health care datasets in Scotland Englandand Wales

UKB participants ScotlandLothian (n = 17249)

UKB participantsScotland alla (n = 31426)

UKB participants Englandalla (n = 18494)

UKB participants Walesalla (n = 21346)

Total stroke codes 225 553 197 301

Unspecified codes inhospital data

29 (38132) 27 (81305) 7 (10139) 17 (30175)

Unspecified codes inprimary care

70 (105151) 64 (254397) 45 (52115) 50 (80160)

Unspecified codes indeath records

0 (07) 11 (19) na 54 (713)

Unspecified codestotalb

42 (94225) 40 (219553) 19 (37197) 30 (89301)

Values are (n)a We included UKB participants with linkage to hospital admissions data death records and primary care data (whereas around half of UKB participants arelinked to primary care data using Read V2 or V3 coding systems we included only practices using the Read V2 coding system for ease of comparison)b Stroke-coded cases with an unspecified stroke type code (note that if a case has codes from multiple sources including a specified stroke type code inhospital data and an unspecified stroke type code in primary care data or death record data they will count as having a specified code)

e702 Neurology | Volume 95 Number 6 | August 11 2020 NeurologyorgN

Among the 15 participants with a false-positive ICH code 715had another primary stroke type plusmn secondary ICH 415intratumor bleed 315 traumatic or other intracranial bleedand 115 no obvious reason for an ICH code Among the 7participants with a false-positive SAH code 47 were traumaticintracranial bleed 27 another primary stroke type plusmn secondarySAH component and 17 asymptomatic aneurysm

Because performance for hemorrhagic stroke codes wassuboptimal having observed that the more common alter-native diagnoses were traumatic and intratumor bleeds orstroke complicated by a bleed we performed additionalanalyses to explore the effects of different code inclusioncriteria on numbers of cases numbers of true-positives andPPV for ICH and SAH For both ICH and SAH we couldincrease PPV but at the expense of failing to detect some true-positive cases However numbers of cases were too small todraw firm conclusions (figure 4)

Stroke type and subtype distributionsWith information from the EPR a stroke specialist was able toassign a pathologic stroke type to 177 of the 178 true-positivestroke cases 149 (84) were ischemic 11 (6) ICH and 17(10) SAH Depending on the subclassification system usedan ischemic subtype could be determined for 37ndash87 ofcases an ICH subtype for 27ndash100 of cases and SAHsubtype for 94 of cases (figure e-3 dryadorg105061dryadw9ghx3fk0)

DiscussionOur results suggest that stroke cases and cases of ischemicstroke type can be ascertained in UKB through linked codeddata with sufficient accuracy for use in many genetic andepidemiologic research studies without further expert vali-dation since PPVs were generally at least 80 despite

Figure 2 Positive predictive values (PPVs) of stroke codes

PPVs of stroke codes stratified by code source (A) and code type (B) Primary position includes primary care codes where no code position is specified andonly primary position hospital admission codes CI = confidence interval ICH = intracerebral hemorrhage SAH = subarachnoid hemorrhage

NeurologyorgN Neurology | Volume 95 Number 6 | August 11 2020 e703

stringent adjudication criteria Primary care is an importantcode source with ge13 of the cases ascertained exclusivelyvia primary care data in this UK setting Code accuracyappeared slightly better for hospital admission compared toprimary care codes There were insufficient data to draw firmconclusions regarding the accuracy of death record orhemorrhagic stroke type codes Including only the primaryposition codes from hospital admission data increased PPVwithout loss of a substantial proportion of true-positivecases In the subpopulation of Scotland studied only around60 of codes were specific for a stroke type however usingall available relevant medical information from the EPR anexpert adjudicator could assign a stroke type to 99 anda more detailed stroke subtype to between a third and almost90 of cases depending on the subclassification systemused A higher proportion of participants had a specifiedstroke type code in the English and Welsh data but theaccuracy of these codes needs further investigation We alsodemonstrated that at least in this setting considering onlythe administrative accuracy of stroke diagnosis codes (ratherthan overall accuracy based on an expert review of the fullEPR) may give falsely inflated PPVs

Acceptable levels of accuracy and the relative importanceof different accuracy metrics depend on the context8 UKBis primarily used for research into the genetic and non-genetic determinants of disease3 In such analyses it isimportant to ensure that a high proportion of participantsidentified as disease cases truly do have the disease (high

PPV) to minimize bias in effect estimates while aiming tooptimize statistical power by ascertaining as many true-positive cases as possible (optimizing sensitivity but notnecessarily maximizing it as this may compromise thePPV) A high specificity (the proportion of participantswithout the disease that do not receive a stroke code) iscrucial in obtaining a high PPV but is not sufficient in andof itself In population-based prospective cohorts where theproportion of all participants who are true-positives forthe disease (in this case stroke) is generally low the pro-portion of all participants incorrectly classified as havingstroke (false-positives) will generally be low (giving highspecificity) even if the absolute number of false-positives ishigh compared to the absolute number of true-positives(low PPV)8 Providing appropriate codes are used boththe specificity and negative predictive value of routinelycollected health care data to identify disease casesin population-based studies are usually very high(96ndash100)4 For these reasons we focused on estimatingthe PPV of using routinely collected health care data toidentify stroke cases in UKB and on assessing the effects ofdifferent code and source selections on both PPV andnumber of true-positive cases

Strengths of this study include (1) the overall large number ofparticipants (2) inclusion of primary care and death recordcodes for which accuracy data have previously been limited(3) creation of vignettes using preset criteria from the EPR for97 of otherwise eligible stroke-coded cases thus avoiding

Figure 3 Assessing administrative vs overall accuracy

High clinical certainty mentions of stroke ldquostrokerdquo ldquoprobable strokerdquo ldquopresumptive strokerdquo ldquoconsistent with strokerdquo ldquocompatible with strokerdquo ldquolikely strokerdquoldquotreated as strokerdquo or equivalent stroke terms (ICH SAH intracerebral hemorrhage subarachnoid hemorrhage ischemic stroke infarct) Medium clinicalcertainty mentions of stroke including above plus ldquopossible strokerdquo ldquosuspected strokerdquo ldquoimpression of strokerdquo ldquosuggestive of strokerdquo ldquoquery strokerdquo orequivalent stroke terms (ICH SAH intracerebral hemorrhage subarachnoid hemorrhage ischemic stroke infarct) Low clinical certainty mentions of strokeinclude above plus ldquoTIArdquo and ldquotransient ischaemic attackrdquo with any level of certainty preceding it The hierarchy of clinical certainty was based on the ICD-10clinical coding instruction manual (isdscotlandorgProducts-and-ServicesTerminology-ServicesClinical-Coding- Guidelines April 2010 version) which isused by the coding departments in UK hospitals CI = confidence interval PPV = positive predictive value

e704 Neurology | Volume 95 Number 6 | August 11 2020 NeurologyorgN

the bias of selecting participants with a higher prior proba-bility of having had a stroke (4) blinding of adjudicators tocodes and each otherrsquos diagnoses (5) assessment of inter-adjudicator reliability (6) further analysis of false-positivecodes and (7) assessment of both administrative and overallaccuracy demonstrating the importance of an expert-ledreference standard in validation studies

There are some limitations (1) small numbers precludedrobust conclusions about the accuracy of death record andhemorrhagic stroke codes (2) our lack of access to the fullfree text EPRs held exclusively by the primary care systemmay have led to slightly more conservative PPV estimates(although we would not expect this to be a major issue forstroke) as almost all stroke cases have acute inpatient oroutpatient management in secondary care (in keeping withthis is that we were able to find secondary care records for97 of the cases) and information provided from primaryto secondary care was available in the hospital EPR (3) thepotential difficulty in differentiating between definite false-positive and uncertain cases from retrospective review ofmedical records which may have further reduced the es-timated PPVs and (4) the restriction of the current studyto a Scottish subpopulation of the UKB participants (al-though we have no reason to suspect substantial variationin the results considering that the code generation processacross the United Kingdom follows similar pathways andgiven the broadly similar distribution of code sources in

the wider UKB population across England Scotland andWales)

The present study is limited to validating the accuracy of codesfor symptomatic stroke (as defined by the WHO) Subclinicalcerebrovascular disease (such as that detected by brain imaging)which is more common particularly in older people9 would notbe readily ascertained by the linked coded health care datasources used to follow the health of UKB participants but can bedetected through brain imaging conducted as part of the UKBmultimodal imaging study of 100000 of its participants

These results are largely consistent with those from our earliersystematic review4 where PPVs for stroke-specific ICD-10hospital admission codes ranged from 79 to 831011 andfor ischemic stroke from 86 to 88101213 Our results forICH and SAH code accuracy among a small number of par-ticipants were worse than those from the 2 previous UnitedKingdomndashbased studies of ICD-10 or Read codes1314 Thismay be in part because these studies checked only (or mainly)for administrative accuracy hence inflating the PPVs Ourresults were similar to a Danish study which also used anexpert review of medical records to check for overall accu-racy10 To our knowledge there are no previous validationstudies of stroke-specific UK primary care Read codes ordeath record codes for diagnosis of all types of stroke nor ofthe effect of combining primary care and hospital admissioncoded data

Figure 4 Exploratory analyses to improve accuracy of hemorrhagic stroke codes

Excluding other stroke code sameday excluded cases with a diagnostic code for gt1 stroke pathologic type on the sameday This was done because a patientwho has one pathologic stroke type (eg ischemic stroke) can sometimes develop a complication and subsequent brain scan appearances similar to anotherpathologic stroke type (eg a patient with ischemic stroke can have a bleed in the brain as a result of the ischemic stroke which could lead to a false diagnosisof an intracerebral hemorrhage [ICH]) CI = confidence interval PPV = positive predictive value SAH = subarachnoid hemorrhage

NeurologyorgN Neurology | Volume 95 Number 6 | August 11 2020 e705

Further work will be required to further explore the accuracyof hemorrhagic stroke and death record codes in largernumbers of cases by expanding this work to other UKB re-cruitment locations which will also enable assessment of thegeneralizability of these results to other regions of the UnitedKingdom In our Scottish subpopulation while over one-thirdof the codes were not specific for a stroke type expert adju-dication allowed a stroke type to be assigned for almost allcases Future studies should investigate ways of automatingthis process as well as methods for determining more detailedstroke subtypes

Our results suggest that stroke and ischemic stroke cases inUKB can be identified through linked coded data withsufficient accuracy for many genetic and epidemiologicstudies while there are currently insufficient data to drawdefinite conclusions about the accuracy of hemorrhagicstroke codes

AcknowledgmentThe authors thank the UK Biobank participants UK Biobankscientific project and data management teams in OxfordStockport and Edinburgh and the UK Biobank strokeoutcomes working group (Oxford University Valerie BeralYiping Chen Zhengming Chen Jane Green Mary KrollSarah Lewington Peter Rothwell Lucy Wright EdinburghUniversity Martin Dennis Joanna Wardlaw Sarah Wild)This research was awarded the University of EdinburghArthur Fonville prize

Study fundingThis work was supported by salary from UK Biobank andHealth Data Research UK (HDRUK) fellowship (KR)salary from Dementias Platform UK (DPUK) clinical fel-lowship (KB) salary from Medical Research Council Clin-ical Research Training Fellowship (MRP0018231) (TW)salary from UK Biobank and DPUK (RF CS) salary fromUK Biobank (KW RW QZ) and salary from UK Bio-bank Scottish Funding Council DPUK and HDRUK(CLMS) (Chief Scientist for UK Biobank) Project man-agement and administration funding provided by UKBiobank

DisclosureK Rannikmae K Ngoh K Bush R Salman F Doubal RFlaig D Henshall A Hutchison J Nolan S Osborne NSamarasekera C Schnier W Whiteley T Wilkinson KWilson R Woodfield and Q Zhang report no disclosuresrelevant to the manuscript N Allen was senior epidemiologistwhen this study was conducted C Sudlow was chief scientistfor the UK Biobank study when this study was conducted Goto NeurologyorgN for full disclosures

Publication historyReceived by Neurology July 4 2019 Accepted in final formJanuary 27 2020

Appendix Authors

Name Location Contribution

KristiinaRannikmaePhD

UniversityofEdinburgh

Substantial contributions to theconception and design of the workacquisition analysis andinterpretation of data

Kenneth Ngoh UniversityofEdinburgh

Substantial contributions toacquisition analysis andinterpretation of data

Kathryn Bush UniversityofEdinburgh

Substantial contributions to theanalysis and interpretation of data

Rustam Al-Shahi SalmanPhD

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

Fergus DoubalPhD

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

Robin Flaig UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

David EHenshall

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

AidanHutchison

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

John Nolan UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

Scott Osborne UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

NeshikaSamarasekeraPhD

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

ChristianSchnier PhD

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

Will WhiteleyPhD

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

Tim Wilkinson UniversityofEdinburgh

Substantial contributions to theconception and design of the work andinterpretation of data

Kirsty Wilson UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

RebeccaWoodfield PhD

UniversityofEdinburgh

Substantial contributions to theconception of the work andinterpretation of data

Qiuli ZhangPhD

UniversityofEdinburgh

Substantial contributions to theconception and design acquisitionanalysis and interpretation of data

Naomi AllenPhD

Universityof Oxford

Substantial contributions to theacquisition and interpretation of data

Cathie LMSudlow PhD

UniversityofEdinburgh

Substantial contributions to theconception and design of the workacquisition and interpretation of data

e706 Neurology | Volume 95 Number 6 | August 11 2020 NeurologyorgN

References1 Lozano R Naghavi M Foreman K et al Global and regional mortality from 235

causes of death for 20 age groups in 1990 and 2010 a systematic analysis for theGlobal Burden of Disease Study 2010 Lancet 20123802095ndash2128

2 Burton PR Hansell AL Fortier I et al Size matters just how big is BIG quantifyingrealistic sample size requirements for human genome epidemiology Int J Epidemiol2009263ndash273

3 Sudlow C Gallacher J Allen N et al UK Biobank an open access resource foridentifying the causes of a wide range of complex diseases of middle and old age PLoSMed 201512e1001779

4 Woodfield R Grant I UK Biobank Stroke Outcomes Group UK Biobank Follow-Upand OutcomesWorking Group Sudlow CL Accuracy of electronic health record datafor identifying stroke cases in large-scale epidemiological studies a systematic reviewfrom the UK Biobank stroke outcomes group PLoS One 201510e0140533

5 WHO MONICA Project Investigators The World Health Organization MONICAproject (Monitoring Trends and Determinants in Cardiovascular Disease) J ClinEpidemiol 198841105ndash114

6 Ford DV Jones KH Verplancke JP et al The SAIL Databank building a nationalarchitecture for e-health research and evaluation BMC Health Serv Res 20099157

7 Viera AJ Garrett JM Understanding interobserver agreement the kappa statisticFam Med 200537360ndash363

8 Chubak J Pocobelli G Weiss NS Tradeoffs between accuracy measures for electronichealth care data algorithms J Clin Epidemiol 201265343ndash349e2

9 KirkmanMAMahattanakul W Gregson BAMendelow AD The accuracy of hospitaldischarge coding for hemorrhagic stroke Acta Neurol Belg 2009109114ndash119

10 Johnsen SP Overvad K Soslashrensen HT Tjoslashnneland A Husted SE Predictive value ofstroke and transient ischemic attack discharge diagnoses in the Danish NationalRegistry of Patients J Clin Epidemiol 200255602ndash607

11 Krarup LH Boysen G Janjua H Prescott E Truelsen T Validity of stroke diagnosesin a national register of patients Neuroepidemiology 200728150ndash154

12 Haesebaert J Termoz A Polazzi S et al Can hospital discharge databases be used tofollow ischemic stroke incidence Stroke 2013441770ndash1774

13 Wright FL Green J Canoy D et al Vascular disease in women comparison ofdiagnoses in hospital episode statistics and general practice records in England BMCMed Res Methodol 201212161

14 Gaist D Wallander MA Gonzalez-Perez A Garcia-Rodriguez L Incidence of hem-orrhagic stroke in the general population validation of data from the Health Im-provement Network Pharmacoepidemiol Drug Saf 201322176ndash182

NeurologyorgN Neurology | Volume 95 Number 6 | August 11 2020 e707

DOI 101212WNL0000000000009924202095e697-e707 Published Online before print July 2 2020Neurology

Kristiina Rannikmaumle Kenneth Ngoh Kathryn Bush et al Biobank

Accuracy of identifying incident stroke cases from linked health care data in UK

This information is current as of July 2 2020

ServicesUpdated Information amp

httpnneurologyorgcontent956e697fullincluding high resolution figures can be found at

References httpnneurologyorgcontent956e697fullref-list-1

This article cites 13 articles 1 of which you can access for free at

Citations httpnneurologyorgcontent956e697fullotherarticles

This article has been cited by 1 HighWire-hosted articles

Subspecialty Collections

httpnneurologyorgcgicollectionsubarachnoid_hemorrhageSubarachnoid hemorrhage

httpnneurologyorgcgicollectionintracerebral_hemorrhageIntracerebral hemorrhage

httpnneurologyorgcgicollectioninfarctionInfarction

httpnneurologyorgcgicollectioncohort_studiesCohort studies e

httpnneurologyorgcgicollectionall_cerebrovascular_disease_strokAll Cerebrovascular diseaseStrokefollowing collection(s) This article along with others on similar topics appears in the

Permissions amp Licensing

httpwwwneurologyorgaboutabout_the_journalpermissionsits entirety can be found online atInformation about reproducing this article in parts (figurestables) or in

Reprints

httpnneurologyorgsubscribersadvertiseInformation about ordering reprints can be found online

ISSN 0028-3878 Online ISSN 1526-632XWolters Kluwer Health Inc on behalf of the American Academy of Neurology All rights reserved Print1951 it is now a weekly with 48 issues per year Copyright Copyright copy 2020 The Author(s) Published by

reg is the official journal of the American Academy of Neurology Published continuously sinceNeurology

Page 7: Accuracyofidentifyingincidentstrokecasesfrom linked health ... · UK Biobank (UKB) is a prospective population-based cohort study with extensive phenotypic and genotypic information

Among the 15 participants with a false-positive ICH code 715had another primary stroke type plusmn secondary ICH 415intratumor bleed 315 traumatic or other intracranial bleedand 115 no obvious reason for an ICH code Among the 7participants with a false-positive SAH code 47 were traumaticintracranial bleed 27 another primary stroke type plusmn secondarySAH component and 17 asymptomatic aneurysm

Because performance for hemorrhagic stroke codes wassuboptimal having observed that the more common alter-native diagnoses were traumatic and intratumor bleeds orstroke complicated by a bleed we performed additionalanalyses to explore the effects of different code inclusioncriteria on numbers of cases numbers of true-positives andPPV for ICH and SAH For both ICH and SAH we couldincrease PPV but at the expense of failing to detect some true-positive cases However numbers of cases were too small todraw firm conclusions (figure 4)

Stroke type and subtype distributionsWith information from the EPR a stroke specialist was able toassign a pathologic stroke type to 177 of the 178 true-positivestroke cases 149 (84) were ischemic 11 (6) ICH and 17(10) SAH Depending on the subclassification system usedan ischemic subtype could be determined for 37ndash87 ofcases an ICH subtype for 27ndash100 of cases and SAHsubtype for 94 of cases (figure e-3 dryadorg105061dryadw9ghx3fk0)

DiscussionOur results suggest that stroke cases and cases of ischemicstroke type can be ascertained in UKB through linked codeddata with sufficient accuracy for use in many genetic andepidemiologic research studies without further expert vali-dation since PPVs were generally at least 80 despite

Figure 2 Positive predictive values (PPVs) of stroke codes

PPVs of stroke codes stratified by code source (A) and code type (B) Primary position includes primary care codes where no code position is specified andonly primary position hospital admission codes CI = confidence interval ICH = intracerebral hemorrhage SAH = subarachnoid hemorrhage

NeurologyorgN Neurology | Volume 95 Number 6 | August 11 2020 e703

stringent adjudication criteria Primary care is an importantcode source with ge13 of the cases ascertained exclusivelyvia primary care data in this UK setting Code accuracyappeared slightly better for hospital admission compared toprimary care codes There were insufficient data to draw firmconclusions regarding the accuracy of death record orhemorrhagic stroke type codes Including only the primaryposition codes from hospital admission data increased PPVwithout loss of a substantial proportion of true-positivecases In the subpopulation of Scotland studied only around60 of codes were specific for a stroke type however usingall available relevant medical information from the EPR anexpert adjudicator could assign a stroke type to 99 anda more detailed stroke subtype to between a third and almost90 of cases depending on the subclassification systemused A higher proportion of participants had a specifiedstroke type code in the English and Welsh data but theaccuracy of these codes needs further investigation We alsodemonstrated that at least in this setting considering onlythe administrative accuracy of stroke diagnosis codes (ratherthan overall accuracy based on an expert review of the fullEPR) may give falsely inflated PPVs

Acceptable levels of accuracy and the relative importanceof different accuracy metrics depend on the context8 UKBis primarily used for research into the genetic and non-genetic determinants of disease3 In such analyses it isimportant to ensure that a high proportion of participantsidentified as disease cases truly do have the disease (high

PPV) to minimize bias in effect estimates while aiming tooptimize statistical power by ascertaining as many true-positive cases as possible (optimizing sensitivity but notnecessarily maximizing it as this may compromise thePPV) A high specificity (the proportion of participantswithout the disease that do not receive a stroke code) iscrucial in obtaining a high PPV but is not sufficient in andof itself In population-based prospective cohorts where theproportion of all participants who are true-positives forthe disease (in this case stroke) is generally low the pro-portion of all participants incorrectly classified as havingstroke (false-positives) will generally be low (giving highspecificity) even if the absolute number of false-positives ishigh compared to the absolute number of true-positives(low PPV)8 Providing appropriate codes are used boththe specificity and negative predictive value of routinelycollected health care data to identify disease casesin population-based studies are usually very high(96ndash100)4 For these reasons we focused on estimatingthe PPV of using routinely collected health care data toidentify stroke cases in UKB and on assessing the effects ofdifferent code and source selections on both PPV andnumber of true-positive cases

Strengths of this study include (1) the overall large number ofparticipants (2) inclusion of primary care and death recordcodes for which accuracy data have previously been limited(3) creation of vignettes using preset criteria from the EPR for97 of otherwise eligible stroke-coded cases thus avoiding

Figure 3 Assessing administrative vs overall accuracy

High clinical certainty mentions of stroke ldquostrokerdquo ldquoprobable strokerdquo ldquopresumptive strokerdquo ldquoconsistent with strokerdquo ldquocompatible with strokerdquo ldquolikely strokerdquoldquotreated as strokerdquo or equivalent stroke terms (ICH SAH intracerebral hemorrhage subarachnoid hemorrhage ischemic stroke infarct) Medium clinicalcertainty mentions of stroke including above plus ldquopossible strokerdquo ldquosuspected strokerdquo ldquoimpression of strokerdquo ldquosuggestive of strokerdquo ldquoquery strokerdquo orequivalent stroke terms (ICH SAH intracerebral hemorrhage subarachnoid hemorrhage ischemic stroke infarct) Low clinical certainty mentions of strokeinclude above plus ldquoTIArdquo and ldquotransient ischaemic attackrdquo with any level of certainty preceding it The hierarchy of clinical certainty was based on the ICD-10clinical coding instruction manual (isdscotlandorgProducts-and-ServicesTerminology-ServicesClinical-Coding- Guidelines April 2010 version) which isused by the coding departments in UK hospitals CI = confidence interval PPV = positive predictive value

e704 Neurology | Volume 95 Number 6 | August 11 2020 NeurologyorgN

the bias of selecting participants with a higher prior proba-bility of having had a stroke (4) blinding of adjudicators tocodes and each otherrsquos diagnoses (5) assessment of inter-adjudicator reliability (6) further analysis of false-positivecodes and (7) assessment of both administrative and overallaccuracy demonstrating the importance of an expert-ledreference standard in validation studies

There are some limitations (1) small numbers precludedrobust conclusions about the accuracy of death record andhemorrhagic stroke codes (2) our lack of access to the fullfree text EPRs held exclusively by the primary care systemmay have led to slightly more conservative PPV estimates(although we would not expect this to be a major issue forstroke) as almost all stroke cases have acute inpatient oroutpatient management in secondary care (in keeping withthis is that we were able to find secondary care records for97 of the cases) and information provided from primaryto secondary care was available in the hospital EPR (3) thepotential difficulty in differentiating between definite false-positive and uncertain cases from retrospective review ofmedical records which may have further reduced the es-timated PPVs and (4) the restriction of the current studyto a Scottish subpopulation of the UKB participants (al-though we have no reason to suspect substantial variationin the results considering that the code generation processacross the United Kingdom follows similar pathways andgiven the broadly similar distribution of code sources in

the wider UKB population across England Scotland andWales)

The present study is limited to validating the accuracy of codesfor symptomatic stroke (as defined by the WHO) Subclinicalcerebrovascular disease (such as that detected by brain imaging)which is more common particularly in older people9 would notbe readily ascertained by the linked coded health care datasources used to follow the health of UKB participants but can bedetected through brain imaging conducted as part of the UKBmultimodal imaging study of 100000 of its participants

These results are largely consistent with those from our earliersystematic review4 where PPVs for stroke-specific ICD-10hospital admission codes ranged from 79 to 831011 andfor ischemic stroke from 86 to 88101213 Our results forICH and SAH code accuracy among a small number of par-ticipants were worse than those from the 2 previous UnitedKingdomndashbased studies of ICD-10 or Read codes1314 Thismay be in part because these studies checked only (or mainly)for administrative accuracy hence inflating the PPVs Ourresults were similar to a Danish study which also used anexpert review of medical records to check for overall accu-racy10 To our knowledge there are no previous validationstudies of stroke-specific UK primary care Read codes ordeath record codes for diagnosis of all types of stroke nor ofthe effect of combining primary care and hospital admissioncoded data

Figure 4 Exploratory analyses to improve accuracy of hemorrhagic stroke codes

Excluding other stroke code sameday excluded cases with a diagnostic code for gt1 stroke pathologic type on the sameday This was done because a patientwho has one pathologic stroke type (eg ischemic stroke) can sometimes develop a complication and subsequent brain scan appearances similar to anotherpathologic stroke type (eg a patient with ischemic stroke can have a bleed in the brain as a result of the ischemic stroke which could lead to a false diagnosisof an intracerebral hemorrhage [ICH]) CI = confidence interval PPV = positive predictive value SAH = subarachnoid hemorrhage

NeurologyorgN Neurology | Volume 95 Number 6 | August 11 2020 e705

Further work will be required to further explore the accuracyof hemorrhagic stroke and death record codes in largernumbers of cases by expanding this work to other UKB re-cruitment locations which will also enable assessment of thegeneralizability of these results to other regions of the UnitedKingdom In our Scottish subpopulation while over one-thirdof the codes were not specific for a stroke type expert adju-dication allowed a stroke type to be assigned for almost allcases Future studies should investigate ways of automatingthis process as well as methods for determining more detailedstroke subtypes

Our results suggest that stroke and ischemic stroke cases inUKB can be identified through linked coded data withsufficient accuracy for many genetic and epidemiologicstudies while there are currently insufficient data to drawdefinite conclusions about the accuracy of hemorrhagicstroke codes

AcknowledgmentThe authors thank the UK Biobank participants UK Biobankscientific project and data management teams in OxfordStockport and Edinburgh and the UK Biobank strokeoutcomes working group (Oxford University Valerie BeralYiping Chen Zhengming Chen Jane Green Mary KrollSarah Lewington Peter Rothwell Lucy Wright EdinburghUniversity Martin Dennis Joanna Wardlaw Sarah Wild)This research was awarded the University of EdinburghArthur Fonville prize

Study fundingThis work was supported by salary from UK Biobank andHealth Data Research UK (HDRUK) fellowship (KR)salary from Dementias Platform UK (DPUK) clinical fel-lowship (KB) salary from Medical Research Council Clin-ical Research Training Fellowship (MRP0018231) (TW)salary from UK Biobank and DPUK (RF CS) salary fromUK Biobank (KW RW QZ) and salary from UK Bio-bank Scottish Funding Council DPUK and HDRUK(CLMS) (Chief Scientist for UK Biobank) Project man-agement and administration funding provided by UKBiobank

DisclosureK Rannikmae K Ngoh K Bush R Salman F Doubal RFlaig D Henshall A Hutchison J Nolan S Osborne NSamarasekera C Schnier W Whiteley T Wilkinson KWilson R Woodfield and Q Zhang report no disclosuresrelevant to the manuscript N Allen was senior epidemiologistwhen this study was conducted C Sudlow was chief scientistfor the UK Biobank study when this study was conducted Goto NeurologyorgN for full disclosures

Publication historyReceived by Neurology July 4 2019 Accepted in final formJanuary 27 2020

Appendix Authors

Name Location Contribution

KristiinaRannikmaePhD

UniversityofEdinburgh

Substantial contributions to theconception and design of the workacquisition analysis andinterpretation of data

Kenneth Ngoh UniversityofEdinburgh

Substantial contributions toacquisition analysis andinterpretation of data

Kathryn Bush UniversityofEdinburgh

Substantial contributions to theanalysis and interpretation of data

Rustam Al-Shahi SalmanPhD

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

Fergus DoubalPhD

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

Robin Flaig UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

David EHenshall

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

AidanHutchison

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

John Nolan UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

Scott Osborne UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

NeshikaSamarasekeraPhD

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

ChristianSchnier PhD

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

Will WhiteleyPhD

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

Tim Wilkinson UniversityofEdinburgh

Substantial contributions to theconception and design of the work andinterpretation of data

Kirsty Wilson UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

RebeccaWoodfield PhD

UniversityofEdinburgh

Substantial contributions to theconception of the work andinterpretation of data

Qiuli ZhangPhD

UniversityofEdinburgh

Substantial contributions to theconception and design acquisitionanalysis and interpretation of data

Naomi AllenPhD

Universityof Oxford

Substantial contributions to theacquisition and interpretation of data

Cathie LMSudlow PhD

UniversityofEdinburgh

Substantial contributions to theconception and design of the workacquisition and interpretation of data

e706 Neurology | Volume 95 Number 6 | August 11 2020 NeurologyorgN

References1 Lozano R Naghavi M Foreman K et al Global and regional mortality from 235

causes of death for 20 age groups in 1990 and 2010 a systematic analysis for theGlobal Burden of Disease Study 2010 Lancet 20123802095ndash2128

2 Burton PR Hansell AL Fortier I et al Size matters just how big is BIG quantifyingrealistic sample size requirements for human genome epidemiology Int J Epidemiol2009263ndash273

3 Sudlow C Gallacher J Allen N et al UK Biobank an open access resource foridentifying the causes of a wide range of complex diseases of middle and old age PLoSMed 201512e1001779

4 Woodfield R Grant I UK Biobank Stroke Outcomes Group UK Biobank Follow-Upand OutcomesWorking Group Sudlow CL Accuracy of electronic health record datafor identifying stroke cases in large-scale epidemiological studies a systematic reviewfrom the UK Biobank stroke outcomes group PLoS One 201510e0140533

5 WHO MONICA Project Investigators The World Health Organization MONICAproject (Monitoring Trends and Determinants in Cardiovascular Disease) J ClinEpidemiol 198841105ndash114

6 Ford DV Jones KH Verplancke JP et al The SAIL Databank building a nationalarchitecture for e-health research and evaluation BMC Health Serv Res 20099157

7 Viera AJ Garrett JM Understanding interobserver agreement the kappa statisticFam Med 200537360ndash363

8 Chubak J Pocobelli G Weiss NS Tradeoffs between accuracy measures for electronichealth care data algorithms J Clin Epidemiol 201265343ndash349e2

9 KirkmanMAMahattanakul W Gregson BAMendelow AD The accuracy of hospitaldischarge coding for hemorrhagic stroke Acta Neurol Belg 2009109114ndash119

10 Johnsen SP Overvad K Soslashrensen HT Tjoslashnneland A Husted SE Predictive value ofstroke and transient ischemic attack discharge diagnoses in the Danish NationalRegistry of Patients J Clin Epidemiol 200255602ndash607

11 Krarup LH Boysen G Janjua H Prescott E Truelsen T Validity of stroke diagnosesin a national register of patients Neuroepidemiology 200728150ndash154

12 Haesebaert J Termoz A Polazzi S et al Can hospital discharge databases be used tofollow ischemic stroke incidence Stroke 2013441770ndash1774

13 Wright FL Green J Canoy D et al Vascular disease in women comparison ofdiagnoses in hospital episode statistics and general practice records in England BMCMed Res Methodol 201212161

14 Gaist D Wallander MA Gonzalez-Perez A Garcia-Rodriguez L Incidence of hem-orrhagic stroke in the general population validation of data from the Health Im-provement Network Pharmacoepidemiol Drug Saf 201322176ndash182

NeurologyorgN Neurology | Volume 95 Number 6 | August 11 2020 e707

DOI 101212WNL0000000000009924202095e697-e707 Published Online before print July 2 2020Neurology

Kristiina Rannikmaumle Kenneth Ngoh Kathryn Bush et al Biobank

Accuracy of identifying incident stroke cases from linked health care data in UK

This information is current as of July 2 2020

ServicesUpdated Information amp

httpnneurologyorgcontent956e697fullincluding high resolution figures can be found at

References httpnneurologyorgcontent956e697fullref-list-1

This article cites 13 articles 1 of which you can access for free at

Citations httpnneurologyorgcontent956e697fullotherarticles

This article has been cited by 1 HighWire-hosted articles

Subspecialty Collections

httpnneurologyorgcgicollectionsubarachnoid_hemorrhageSubarachnoid hemorrhage

httpnneurologyorgcgicollectionintracerebral_hemorrhageIntracerebral hemorrhage

httpnneurologyorgcgicollectioninfarctionInfarction

httpnneurologyorgcgicollectioncohort_studiesCohort studies e

httpnneurologyorgcgicollectionall_cerebrovascular_disease_strokAll Cerebrovascular diseaseStrokefollowing collection(s) This article along with others on similar topics appears in the

Permissions amp Licensing

httpwwwneurologyorgaboutabout_the_journalpermissionsits entirety can be found online atInformation about reproducing this article in parts (figurestables) or in

Reprints

httpnneurologyorgsubscribersadvertiseInformation about ordering reprints can be found online

ISSN 0028-3878 Online ISSN 1526-632XWolters Kluwer Health Inc on behalf of the American Academy of Neurology All rights reserved Print1951 it is now a weekly with 48 issues per year Copyright Copyright copy 2020 The Author(s) Published by

reg is the official journal of the American Academy of Neurology Published continuously sinceNeurology

Page 8: Accuracyofidentifyingincidentstrokecasesfrom linked health ... · UK Biobank (UKB) is a prospective population-based cohort study with extensive phenotypic and genotypic information

stringent adjudication criteria Primary care is an importantcode source with ge13 of the cases ascertained exclusivelyvia primary care data in this UK setting Code accuracyappeared slightly better for hospital admission compared toprimary care codes There were insufficient data to draw firmconclusions regarding the accuracy of death record orhemorrhagic stroke type codes Including only the primaryposition codes from hospital admission data increased PPVwithout loss of a substantial proportion of true-positivecases In the subpopulation of Scotland studied only around60 of codes were specific for a stroke type however usingall available relevant medical information from the EPR anexpert adjudicator could assign a stroke type to 99 anda more detailed stroke subtype to between a third and almost90 of cases depending on the subclassification systemused A higher proportion of participants had a specifiedstroke type code in the English and Welsh data but theaccuracy of these codes needs further investigation We alsodemonstrated that at least in this setting considering onlythe administrative accuracy of stroke diagnosis codes (ratherthan overall accuracy based on an expert review of the fullEPR) may give falsely inflated PPVs

Acceptable levels of accuracy and the relative importanceof different accuracy metrics depend on the context8 UKBis primarily used for research into the genetic and non-genetic determinants of disease3 In such analyses it isimportant to ensure that a high proportion of participantsidentified as disease cases truly do have the disease (high

PPV) to minimize bias in effect estimates while aiming tooptimize statistical power by ascertaining as many true-positive cases as possible (optimizing sensitivity but notnecessarily maximizing it as this may compromise thePPV) A high specificity (the proportion of participantswithout the disease that do not receive a stroke code) iscrucial in obtaining a high PPV but is not sufficient in andof itself In population-based prospective cohorts where theproportion of all participants who are true-positives forthe disease (in this case stroke) is generally low the pro-portion of all participants incorrectly classified as havingstroke (false-positives) will generally be low (giving highspecificity) even if the absolute number of false-positives ishigh compared to the absolute number of true-positives(low PPV)8 Providing appropriate codes are used boththe specificity and negative predictive value of routinelycollected health care data to identify disease casesin population-based studies are usually very high(96ndash100)4 For these reasons we focused on estimatingthe PPV of using routinely collected health care data toidentify stroke cases in UKB and on assessing the effects ofdifferent code and source selections on both PPV andnumber of true-positive cases

Strengths of this study include (1) the overall large number ofparticipants (2) inclusion of primary care and death recordcodes for which accuracy data have previously been limited(3) creation of vignettes using preset criteria from the EPR for97 of otherwise eligible stroke-coded cases thus avoiding

Figure 3 Assessing administrative vs overall accuracy

High clinical certainty mentions of stroke ldquostrokerdquo ldquoprobable strokerdquo ldquopresumptive strokerdquo ldquoconsistent with strokerdquo ldquocompatible with strokerdquo ldquolikely strokerdquoldquotreated as strokerdquo or equivalent stroke terms (ICH SAH intracerebral hemorrhage subarachnoid hemorrhage ischemic stroke infarct) Medium clinicalcertainty mentions of stroke including above plus ldquopossible strokerdquo ldquosuspected strokerdquo ldquoimpression of strokerdquo ldquosuggestive of strokerdquo ldquoquery strokerdquo orequivalent stroke terms (ICH SAH intracerebral hemorrhage subarachnoid hemorrhage ischemic stroke infarct) Low clinical certainty mentions of strokeinclude above plus ldquoTIArdquo and ldquotransient ischaemic attackrdquo with any level of certainty preceding it The hierarchy of clinical certainty was based on the ICD-10clinical coding instruction manual (isdscotlandorgProducts-and-ServicesTerminology-ServicesClinical-Coding- Guidelines April 2010 version) which isused by the coding departments in UK hospitals CI = confidence interval PPV = positive predictive value

e704 Neurology | Volume 95 Number 6 | August 11 2020 NeurologyorgN

the bias of selecting participants with a higher prior proba-bility of having had a stroke (4) blinding of adjudicators tocodes and each otherrsquos diagnoses (5) assessment of inter-adjudicator reliability (6) further analysis of false-positivecodes and (7) assessment of both administrative and overallaccuracy demonstrating the importance of an expert-ledreference standard in validation studies

There are some limitations (1) small numbers precludedrobust conclusions about the accuracy of death record andhemorrhagic stroke codes (2) our lack of access to the fullfree text EPRs held exclusively by the primary care systemmay have led to slightly more conservative PPV estimates(although we would not expect this to be a major issue forstroke) as almost all stroke cases have acute inpatient oroutpatient management in secondary care (in keeping withthis is that we were able to find secondary care records for97 of the cases) and information provided from primaryto secondary care was available in the hospital EPR (3) thepotential difficulty in differentiating between definite false-positive and uncertain cases from retrospective review ofmedical records which may have further reduced the es-timated PPVs and (4) the restriction of the current studyto a Scottish subpopulation of the UKB participants (al-though we have no reason to suspect substantial variationin the results considering that the code generation processacross the United Kingdom follows similar pathways andgiven the broadly similar distribution of code sources in

the wider UKB population across England Scotland andWales)

The present study is limited to validating the accuracy of codesfor symptomatic stroke (as defined by the WHO) Subclinicalcerebrovascular disease (such as that detected by brain imaging)which is more common particularly in older people9 would notbe readily ascertained by the linked coded health care datasources used to follow the health of UKB participants but can bedetected through brain imaging conducted as part of the UKBmultimodal imaging study of 100000 of its participants

These results are largely consistent with those from our earliersystematic review4 where PPVs for stroke-specific ICD-10hospital admission codes ranged from 79 to 831011 andfor ischemic stroke from 86 to 88101213 Our results forICH and SAH code accuracy among a small number of par-ticipants were worse than those from the 2 previous UnitedKingdomndashbased studies of ICD-10 or Read codes1314 Thismay be in part because these studies checked only (or mainly)for administrative accuracy hence inflating the PPVs Ourresults were similar to a Danish study which also used anexpert review of medical records to check for overall accu-racy10 To our knowledge there are no previous validationstudies of stroke-specific UK primary care Read codes ordeath record codes for diagnosis of all types of stroke nor ofthe effect of combining primary care and hospital admissioncoded data

Figure 4 Exploratory analyses to improve accuracy of hemorrhagic stroke codes

Excluding other stroke code sameday excluded cases with a diagnostic code for gt1 stroke pathologic type on the sameday This was done because a patientwho has one pathologic stroke type (eg ischemic stroke) can sometimes develop a complication and subsequent brain scan appearances similar to anotherpathologic stroke type (eg a patient with ischemic stroke can have a bleed in the brain as a result of the ischemic stroke which could lead to a false diagnosisof an intracerebral hemorrhage [ICH]) CI = confidence interval PPV = positive predictive value SAH = subarachnoid hemorrhage

NeurologyorgN Neurology | Volume 95 Number 6 | August 11 2020 e705

Further work will be required to further explore the accuracyof hemorrhagic stroke and death record codes in largernumbers of cases by expanding this work to other UKB re-cruitment locations which will also enable assessment of thegeneralizability of these results to other regions of the UnitedKingdom In our Scottish subpopulation while over one-thirdof the codes were not specific for a stroke type expert adju-dication allowed a stroke type to be assigned for almost allcases Future studies should investigate ways of automatingthis process as well as methods for determining more detailedstroke subtypes

Our results suggest that stroke and ischemic stroke cases inUKB can be identified through linked coded data withsufficient accuracy for many genetic and epidemiologicstudies while there are currently insufficient data to drawdefinite conclusions about the accuracy of hemorrhagicstroke codes

AcknowledgmentThe authors thank the UK Biobank participants UK Biobankscientific project and data management teams in OxfordStockport and Edinburgh and the UK Biobank strokeoutcomes working group (Oxford University Valerie BeralYiping Chen Zhengming Chen Jane Green Mary KrollSarah Lewington Peter Rothwell Lucy Wright EdinburghUniversity Martin Dennis Joanna Wardlaw Sarah Wild)This research was awarded the University of EdinburghArthur Fonville prize

Study fundingThis work was supported by salary from UK Biobank andHealth Data Research UK (HDRUK) fellowship (KR)salary from Dementias Platform UK (DPUK) clinical fel-lowship (KB) salary from Medical Research Council Clin-ical Research Training Fellowship (MRP0018231) (TW)salary from UK Biobank and DPUK (RF CS) salary fromUK Biobank (KW RW QZ) and salary from UK Bio-bank Scottish Funding Council DPUK and HDRUK(CLMS) (Chief Scientist for UK Biobank) Project man-agement and administration funding provided by UKBiobank

DisclosureK Rannikmae K Ngoh K Bush R Salman F Doubal RFlaig D Henshall A Hutchison J Nolan S Osborne NSamarasekera C Schnier W Whiteley T Wilkinson KWilson R Woodfield and Q Zhang report no disclosuresrelevant to the manuscript N Allen was senior epidemiologistwhen this study was conducted C Sudlow was chief scientistfor the UK Biobank study when this study was conducted Goto NeurologyorgN for full disclosures

Publication historyReceived by Neurology July 4 2019 Accepted in final formJanuary 27 2020

Appendix Authors

Name Location Contribution

KristiinaRannikmaePhD

UniversityofEdinburgh

Substantial contributions to theconception and design of the workacquisition analysis andinterpretation of data

Kenneth Ngoh UniversityofEdinburgh

Substantial contributions toacquisition analysis andinterpretation of data

Kathryn Bush UniversityofEdinburgh

Substantial contributions to theanalysis and interpretation of data

Rustam Al-Shahi SalmanPhD

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

Fergus DoubalPhD

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

Robin Flaig UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

David EHenshall

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

AidanHutchison

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

John Nolan UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

Scott Osborne UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

NeshikaSamarasekeraPhD

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

ChristianSchnier PhD

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

Will WhiteleyPhD

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

Tim Wilkinson UniversityofEdinburgh

Substantial contributions to theconception and design of the work andinterpretation of data

Kirsty Wilson UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

RebeccaWoodfield PhD

UniversityofEdinburgh

Substantial contributions to theconception of the work andinterpretation of data

Qiuli ZhangPhD

UniversityofEdinburgh

Substantial contributions to theconception and design acquisitionanalysis and interpretation of data

Naomi AllenPhD

Universityof Oxford

Substantial contributions to theacquisition and interpretation of data

Cathie LMSudlow PhD

UniversityofEdinburgh

Substantial contributions to theconception and design of the workacquisition and interpretation of data

e706 Neurology | Volume 95 Number 6 | August 11 2020 NeurologyorgN

References1 Lozano R Naghavi M Foreman K et al Global and regional mortality from 235

causes of death for 20 age groups in 1990 and 2010 a systematic analysis for theGlobal Burden of Disease Study 2010 Lancet 20123802095ndash2128

2 Burton PR Hansell AL Fortier I et al Size matters just how big is BIG quantifyingrealistic sample size requirements for human genome epidemiology Int J Epidemiol2009263ndash273

3 Sudlow C Gallacher J Allen N et al UK Biobank an open access resource foridentifying the causes of a wide range of complex diseases of middle and old age PLoSMed 201512e1001779

4 Woodfield R Grant I UK Biobank Stroke Outcomes Group UK Biobank Follow-Upand OutcomesWorking Group Sudlow CL Accuracy of electronic health record datafor identifying stroke cases in large-scale epidemiological studies a systematic reviewfrom the UK Biobank stroke outcomes group PLoS One 201510e0140533

5 WHO MONICA Project Investigators The World Health Organization MONICAproject (Monitoring Trends and Determinants in Cardiovascular Disease) J ClinEpidemiol 198841105ndash114

6 Ford DV Jones KH Verplancke JP et al The SAIL Databank building a nationalarchitecture for e-health research and evaluation BMC Health Serv Res 20099157

7 Viera AJ Garrett JM Understanding interobserver agreement the kappa statisticFam Med 200537360ndash363

8 Chubak J Pocobelli G Weiss NS Tradeoffs between accuracy measures for electronichealth care data algorithms J Clin Epidemiol 201265343ndash349e2

9 KirkmanMAMahattanakul W Gregson BAMendelow AD The accuracy of hospitaldischarge coding for hemorrhagic stroke Acta Neurol Belg 2009109114ndash119

10 Johnsen SP Overvad K Soslashrensen HT Tjoslashnneland A Husted SE Predictive value ofstroke and transient ischemic attack discharge diagnoses in the Danish NationalRegistry of Patients J Clin Epidemiol 200255602ndash607

11 Krarup LH Boysen G Janjua H Prescott E Truelsen T Validity of stroke diagnosesin a national register of patients Neuroepidemiology 200728150ndash154

12 Haesebaert J Termoz A Polazzi S et al Can hospital discharge databases be used tofollow ischemic stroke incidence Stroke 2013441770ndash1774

13 Wright FL Green J Canoy D et al Vascular disease in women comparison ofdiagnoses in hospital episode statistics and general practice records in England BMCMed Res Methodol 201212161

14 Gaist D Wallander MA Gonzalez-Perez A Garcia-Rodriguez L Incidence of hem-orrhagic stroke in the general population validation of data from the Health Im-provement Network Pharmacoepidemiol Drug Saf 201322176ndash182

NeurologyorgN Neurology | Volume 95 Number 6 | August 11 2020 e707

DOI 101212WNL0000000000009924202095e697-e707 Published Online before print July 2 2020Neurology

Kristiina Rannikmaumle Kenneth Ngoh Kathryn Bush et al Biobank

Accuracy of identifying incident stroke cases from linked health care data in UK

This information is current as of July 2 2020

ServicesUpdated Information amp

httpnneurologyorgcontent956e697fullincluding high resolution figures can be found at

References httpnneurologyorgcontent956e697fullref-list-1

This article cites 13 articles 1 of which you can access for free at

Citations httpnneurologyorgcontent956e697fullotherarticles

This article has been cited by 1 HighWire-hosted articles

Subspecialty Collections

httpnneurologyorgcgicollectionsubarachnoid_hemorrhageSubarachnoid hemorrhage

httpnneurologyorgcgicollectionintracerebral_hemorrhageIntracerebral hemorrhage

httpnneurologyorgcgicollectioninfarctionInfarction

httpnneurologyorgcgicollectioncohort_studiesCohort studies e

httpnneurologyorgcgicollectionall_cerebrovascular_disease_strokAll Cerebrovascular diseaseStrokefollowing collection(s) This article along with others on similar topics appears in the

Permissions amp Licensing

httpwwwneurologyorgaboutabout_the_journalpermissionsits entirety can be found online atInformation about reproducing this article in parts (figurestables) or in

Reprints

httpnneurologyorgsubscribersadvertiseInformation about ordering reprints can be found online

ISSN 0028-3878 Online ISSN 1526-632XWolters Kluwer Health Inc on behalf of the American Academy of Neurology All rights reserved Print1951 it is now a weekly with 48 issues per year Copyright Copyright copy 2020 The Author(s) Published by

reg is the official journal of the American Academy of Neurology Published continuously sinceNeurology

Page 9: Accuracyofidentifyingincidentstrokecasesfrom linked health ... · UK Biobank (UKB) is a prospective population-based cohort study with extensive phenotypic and genotypic information

the bias of selecting participants with a higher prior proba-bility of having had a stroke (4) blinding of adjudicators tocodes and each otherrsquos diagnoses (5) assessment of inter-adjudicator reliability (6) further analysis of false-positivecodes and (7) assessment of both administrative and overallaccuracy demonstrating the importance of an expert-ledreference standard in validation studies

There are some limitations (1) small numbers precludedrobust conclusions about the accuracy of death record andhemorrhagic stroke codes (2) our lack of access to the fullfree text EPRs held exclusively by the primary care systemmay have led to slightly more conservative PPV estimates(although we would not expect this to be a major issue forstroke) as almost all stroke cases have acute inpatient oroutpatient management in secondary care (in keeping withthis is that we were able to find secondary care records for97 of the cases) and information provided from primaryto secondary care was available in the hospital EPR (3) thepotential difficulty in differentiating between definite false-positive and uncertain cases from retrospective review ofmedical records which may have further reduced the es-timated PPVs and (4) the restriction of the current studyto a Scottish subpopulation of the UKB participants (al-though we have no reason to suspect substantial variationin the results considering that the code generation processacross the United Kingdom follows similar pathways andgiven the broadly similar distribution of code sources in

the wider UKB population across England Scotland andWales)

The present study is limited to validating the accuracy of codesfor symptomatic stroke (as defined by the WHO) Subclinicalcerebrovascular disease (such as that detected by brain imaging)which is more common particularly in older people9 would notbe readily ascertained by the linked coded health care datasources used to follow the health of UKB participants but can bedetected through brain imaging conducted as part of the UKBmultimodal imaging study of 100000 of its participants

These results are largely consistent with those from our earliersystematic review4 where PPVs for stroke-specific ICD-10hospital admission codes ranged from 79 to 831011 andfor ischemic stroke from 86 to 88101213 Our results forICH and SAH code accuracy among a small number of par-ticipants were worse than those from the 2 previous UnitedKingdomndashbased studies of ICD-10 or Read codes1314 Thismay be in part because these studies checked only (or mainly)for administrative accuracy hence inflating the PPVs Ourresults were similar to a Danish study which also used anexpert review of medical records to check for overall accu-racy10 To our knowledge there are no previous validationstudies of stroke-specific UK primary care Read codes ordeath record codes for diagnosis of all types of stroke nor ofthe effect of combining primary care and hospital admissioncoded data

Figure 4 Exploratory analyses to improve accuracy of hemorrhagic stroke codes

Excluding other stroke code sameday excluded cases with a diagnostic code for gt1 stroke pathologic type on the sameday This was done because a patientwho has one pathologic stroke type (eg ischemic stroke) can sometimes develop a complication and subsequent brain scan appearances similar to anotherpathologic stroke type (eg a patient with ischemic stroke can have a bleed in the brain as a result of the ischemic stroke which could lead to a false diagnosisof an intracerebral hemorrhage [ICH]) CI = confidence interval PPV = positive predictive value SAH = subarachnoid hemorrhage

NeurologyorgN Neurology | Volume 95 Number 6 | August 11 2020 e705

Further work will be required to further explore the accuracyof hemorrhagic stroke and death record codes in largernumbers of cases by expanding this work to other UKB re-cruitment locations which will also enable assessment of thegeneralizability of these results to other regions of the UnitedKingdom In our Scottish subpopulation while over one-thirdof the codes were not specific for a stroke type expert adju-dication allowed a stroke type to be assigned for almost allcases Future studies should investigate ways of automatingthis process as well as methods for determining more detailedstroke subtypes

Our results suggest that stroke and ischemic stroke cases inUKB can be identified through linked coded data withsufficient accuracy for many genetic and epidemiologicstudies while there are currently insufficient data to drawdefinite conclusions about the accuracy of hemorrhagicstroke codes

AcknowledgmentThe authors thank the UK Biobank participants UK Biobankscientific project and data management teams in OxfordStockport and Edinburgh and the UK Biobank strokeoutcomes working group (Oxford University Valerie BeralYiping Chen Zhengming Chen Jane Green Mary KrollSarah Lewington Peter Rothwell Lucy Wright EdinburghUniversity Martin Dennis Joanna Wardlaw Sarah Wild)This research was awarded the University of EdinburghArthur Fonville prize

Study fundingThis work was supported by salary from UK Biobank andHealth Data Research UK (HDRUK) fellowship (KR)salary from Dementias Platform UK (DPUK) clinical fel-lowship (KB) salary from Medical Research Council Clin-ical Research Training Fellowship (MRP0018231) (TW)salary from UK Biobank and DPUK (RF CS) salary fromUK Biobank (KW RW QZ) and salary from UK Bio-bank Scottish Funding Council DPUK and HDRUK(CLMS) (Chief Scientist for UK Biobank) Project man-agement and administration funding provided by UKBiobank

DisclosureK Rannikmae K Ngoh K Bush R Salman F Doubal RFlaig D Henshall A Hutchison J Nolan S Osborne NSamarasekera C Schnier W Whiteley T Wilkinson KWilson R Woodfield and Q Zhang report no disclosuresrelevant to the manuscript N Allen was senior epidemiologistwhen this study was conducted C Sudlow was chief scientistfor the UK Biobank study when this study was conducted Goto NeurologyorgN for full disclosures

Publication historyReceived by Neurology July 4 2019 Accepted in final formJanuary 27 2020

Appendix Authors

Name Location Contribution

KristiinaRannikmaePhD

UniversityofEdinburgh

Substantial contributions to theconception and design of the workacquisition analysis andinterpretation of data

Kenneth Ngoh UniversityofEdinburgh

Substantial contributions toacquisition analysis andinterpretation of data

Kathryn Bush UniversityofEdinburgh

Substantial contributions to theanalysis and interpretation of data

Rustam Al-Shahi SalmanPhD

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

Fergus DoubalPhD

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

Robin Flaig UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

David EHenshall

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

AidanHutchison

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

John Nolan UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

Scott Osborne UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

NeshikaSamarasekeraPhD

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

ChristianSchnier PhD

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

Will WhiteleyPhD

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

Tim Wilkinson UniversityofEdinburgh

Substantial contributions to theconception and design of the work andinterpretation of data

Kirsty Wilson UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

RebeccaWoodfield PhD

UniversityofEdinburgh

Substantial contributions to theconception of the work andinterpretation of data

Qiuli ZhangPhD

UniversityofEdinburgh

Substantial contributions to theconception and design acquisitionanalysis and interpretation of data

Naomi AllenPhD

Universityof Oxford

Substantial contributions to theacquisition and interpretation of data

Cathie LMSudlow PhD

UniversityofEdinburgh

Substantial contributions to theconception and design of the workacquisition and interpretation of data

e706 Neurology | Volume 95 Number 6 | August 11 2020 NeurologyorgN

References1 Lozano R Naghavi M Foreman K et al Global and regional mortality from 235

causes of death for 20 age groups in 1990 and 2010 a systematic analysis for theGlobal Burden of Disease Study 2010 Lancet 20123802095ndash2128

2 Burton PR Hansell AL Fortier I et al Size matters just how big is BIG quantifyingrealistic sample size requirements for human genome epidemiology Int J Epidemiol2009263ndash273

3 Sudlow C Gallacher J Allen N et al UK Biobank an open access resource foridentifying the causes of a wide range of complex diseases of middle and old age PLoSMed 201512e1001779

4 Woodfield R Grant I UK Biobank Stroke Outcomes Group UK Biobank Follow-Upand OutcomesWorking Group Sudlow CL Accuracy of electronic health record datafor identifying stroke cases in large-scale epidemiological studies a systematic reviewfrom the UK Biobank stroke outcomes group PLoS One 201510e0140533

5 WHO MONICA Project Investigators The World Health Organization MONICAproject (Monitoring Trends and Determinants in Cardiovascular Disease) J ClinEpidemiol 198841105ndash114

6 Ford DV Jones KH Verplancke JP et al The SAIL Databank building a nationalarchitecture for e-health research and evaluation BMC Health Serv Res 20099157

7 Viera AJ Garrett JM Understanding interobserver agreement the kappa statisticFam Med 200537360ndash363

8 Chubak J Pocobelli G Weiss NS Tradeoffs between accuracy measures for electronichealth care data algorithms J Clin Epidemiol 201265343ndash349e2

9 KirkmanMAMahattanakul W Gregson BAMendelow AD The accuracy of hospitaldischarge coding for hemorrhagic stroke Acta Neurol Belg 2009109114ndash119

10 Johnsen SP Overvad K Soslashrensen HT Tjoslashnneland A Husted SE Predictive value ofstroke and transient ischemic attack discharge diagnoses in the Danish NationalRegistry of Patients J Clin Epidemiol 200255602ndash607

11 Krarup LH Boysen G Janjua H Prescott E Truelsen T Validity of stroke diagnosesin a national register of patients Neuroepidemiology 200728150ndash154

12 Haesebaert J Termoz A Polazzi S et al Can hospital discharge databases be used tofollow ischemic stroke incidence Stroke 2013441770ndash1774

13 Wright FL Green J Canoy D et al Vascular disease in women comparison ofdiagnoses in hospital episode statistics and general practice records in England BMCMed Res Methodol 201212161

14 Gaist D Wallander MA Gonzalez-Perez A Garcia-Rodriguez L Incidence of hem-orrhagic stroke in the general population validation of data from the Health Im-provement Network Pharmacoepidemiol Drug Saf 201322176ndash182

NeurologyorgN Neurology | Volume 95 Number 6 | August 11 2020 e707

DOI 101212WNL0000000000009924202095e697-e707 Published Online before print July 2 2020Neurology

Kristiina Rannikmaumle Kenneth Ngoh Kathryn Bush et al Biobank

Accuracy of identifying incident stroke cases from linked health care data in UK

This information is current as of July 2 2020

ServicesUpdated Information amp

httpnneurologyorgcontent956e697fullincluding high resolution figures can be found at

References httpnneurologyorgcontent956e697fullref-list-1

This article cites 13 articles 1 of which you can access for free at

Citations httpnneurologyorgcontent956e697fullotherarticles

This article has been cited by 1 HighWire-hosted articles

Subspecialty Collections

httpnneurologyorgcgicollectionsubarachnoid_hemorrhageSubarachnoid hemorrhage

httpnneurologyorgcgicollectionintracerebral_hemorrhageIntracerebral hemorrhage

httpnneurologyorgcgicollectioninfarctionInfarction

httpnneurologyorgcgicollectioncohort_studiesCohort studies e

httpnneurologyorgcgicollectionall_cerebrovascular_disease_strokAll Cerebrovascular diseaseStrokefollowing collection(s) This article along with others on similar topics appears in the

Permissions amp Licensing

httpwwwneurologyorgaboutabout_the_journalpermissionsits entirety can be found online atInformation about reproducing this article in parts (figurestables) or in

Reprints

httpnneurologyorgsubscribersadvertiseInformation about ordering reprints can be found online

ISSN 0028-3878 Online ISSN 1526-632XWolters Kluwer Health Inc on behalf of the American Academy of Neurology All rights reserved Print1951 it is now a weekly with 48 issues per year Copyright Copyright copy 2020 The Author(s) Published by

reg is the official journal of the American Academy of Neurology Published continuously sinceNeurology

Page 10: Accuracyofidentifyingincidentstrokecasesfrom linked health ... · UK Biobank (UKB) is a prospective population-based cohort study with extensive phenotypic and genotypic information

Further work will be required to further explore the accuracyof hemorrhagic stroke and death record codes in largernumbers of cases by expanding this work to other UKB re-cruitment locations which will also enable assessment of thegeneralizability of these results to other regions of the UnitedKingdom In our Scottish subpopulation while over one-thirdof the codes were not specific for a stroke type expert adju-dication allowed a stroke type to be assigned for almost allcases Future studies should investigate ways of automatingthis process as well as methods for determining more detailedstroke subtypes

Our results suggest that stroke and ischemic stroke cases inUKB can be identified through linked coded data withsufficient accuracy for many genetic and epidemiologicstudies while there are currently insufficient data to drawdefinite conclusions about the accuracy of hemorrhagicstroke codes

AcknowledgmentThe authors thank the UK Biobank participants UK Biobankscientific project and data management teams in OxfordStockport and Edinburgh and the UK Biobank strokeoutcomes working group (Oxford University Valerie BeralYiping Chen Zhengming Chen Jane Green Mary KrollSarah Lewington Peter Rothwell Lucy Wright EdinburghUniversity Martin Dennis Joanna Wardlaw Sarah Wild)This research was awarded the University of EdinburghArthur Fonville prize

Study fundingThis work was supported by salary from UK Biobank andHealth Data Research UK (HDRUK) fellowship (KR)salary from Dementias Platform UK (DPUK) clinical fel-lowship (KB) salary from Medical Research Council Clin-ical Research Training Fellowship (MRP0018231) (TW)salary from UK Biobank and DPUK (RF CS) salary fromUK Biobank (KW RW QZ) and salary from UK Bio-bank Scottish Funding Council DPUK and HDRUK(CLMS) (Chief Scientist for UK Biobank) Project man-agement and administration funding provided by UKBiobank

DisclosureK Rannikmae K Ngoh K Bush R Salman F Doubal RFlaig D Henshall A Hutchison J Nolan S Osborne NSamarasekera C Schnier W Whiteley T Wilkinson KWilson R Woodfield and Q Zhang report no disclosuresrelevant to the manuscript N Allen was senior epidemiologistwhen this study was conducted C Sudlow was chief scientistfor the UK Biobank study when this study was conducted Goto NeurologyorgN for full disclosures

Publication historyReceived by Neurology July 4 2019 Accepted in final formJanuary 27 2020

Appendix Authors

Name Location Contribution

KristiinaRannikmaePhD

UniversityofEdinburgh

Substantial contributions to theconception and design of the workacquisition analysis andinterpretation of data

Kenneth Ngoh UniversityofEdinburgh

Substantial contributions toacquisition analysis andinterpretation of data

Kathryn Bush UniversityofEdinburgh

Substantial contributions to theanalysis and interpretation of data

Rustam Al-Shahi SalmanPhD

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

Fergus DoubalPhD

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

Robin Flaig UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

David EHenshall

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

AidanHutchison

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

John Nolan UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

Scott Osborne UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

NeshikaSamarasekeraPhD

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

ChristianSchnier PhD

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

Will WhiteleyPhD

UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

Tim Wilkinson UniversityofEdinburgh

Substantial contributions to theconception and design of the work andinterpretation of data

Kirsty Wilson UniversityofEdinburgh

Substantial contributions to theacquisition and interpretation of data

RebeccaWoodfield PhD

UniversityofEdinburgh

Substantial contributions to theconception of the work andinterpretation of data

Qiuli ZhangPhD

UniversityofEdinburgh

Substantial contributions to theconception and design acquisitionanalysis and interpretation of data

Naomi AllenPhD

Universityof Oxford

Substantial contributions to theacquisition and interpretation of data

Cathie LMSudlow PhD

UniversityofEdinburgh

Substantial contributions to theconception and design of the workacquisition and interpretation of data

e706 Neurology | Volume 95 Number 6 | August 11 2020 NeurologyorgN

References1 Lozano R Naghavi M Foreman K et al Global and regional mortality from 235

causes of death for 20 age groups in 1990 and 2010 a systematic analysis for theGlobal Burden of Disease Study 2010 Lancet 20123802095ndash2128

2 Burton PR Hansell AL Fortier I et al Size matters just how big is BIG quantifyingrealistic sample size requirements for human genome epidemiology Int J Epidemiol2009263ndash273

3 Sudlow C Gallacher J Allen N et al UK Biobank an open access resource foridentifying the causes of a wide range of complex diseases of middle and old age PLoSMed 201512e1001779

4 Woodfield R Grant I UK Biobank Stroke Outcomes Group UK Biobank Follow-Upand OutcomesWorking Group Sudlow CL Accuracy of electronic health record datafor identifying stroke cases in large-scale epidemiological studies a systematic reviewfrom the UK Biobank stroke outcomes group PLoS One 201510e0140533

5 WHO MONICA Project Investigators The World Health Organization MONICAproject (Monitoring Trends and Determinants in Cardiovascular Disease) J ClinEpidemiol 198841105ndash114

6 Ford DV Jones KH Verplancke JP et al The SAIL Databank building a nationalarchitecture for e-health research and evaluation BMC Health Serv Res 20099157

7 Viera AJ Garrett JM Understanding interobserver agreement the kappa statisticFam Med 200537360ndash363

8 Chubak J Pocobelli G Weiss NS Tradeoffs between accuracy measures for electronichealth care data algorithms J Clin Epidemiol 201265343ndash349e2

9 KirkmanMAMahattanakul W Gregson BAMendelow AD The accuracy of hospitaldischarge coding for hemorrhagic stroke Acta Neurol Belg 2009109114ndash119

10 Johnsen SP Overvad K Soslashrensen HT Tjoslashnneland A Husted SE Predictive value ofstroke and transient ischemic attack discharge diagnoses in the Danish NationalRegistry of Patients J Clin Epidemiol 200255602ndash607

11 Krarup LH Boysen G Janjua H Prescott E Truelsen T Validity of stroke diagnosesin a national register of patients Neuroepidemiology 200728150ndash154

12 Haesebaert J Termoz A Polazzi S et al Can hospital discharge databases be used tofollow ischemic stroke incidence Stroke 2013441770ndash1774

13 Wright FL Green J Canoy D et al Vascular disease in women comparison ofdiagnoses in hospital episode statistics and general practice records in England BMCMed Res Methodol 201212161

14 Gaist D Wallander MA Gonzalez-Perez A Garcia-Rodriguez L Incidence of hem-orrhagic stroke in the general population validation of data from the Health Im-provement Network Pharmacoepidemiol Drug Saf 201322176ndash182

NeurologyorgN Neurology | Volume 95 Number 6 | August 11 2020 e707

DOI 101212WNL0000000000009924202095e697-e707 Published Online before print July 2 2020Neurology

Kristiina Rannikmaumle Kenneth Ngoh Kathryn Bush et al Biobank

Accuracy of identifying incident stroke cases from linked health care data in UK

This information is current as of July 2 2020

ServicesUpdated Information amp

httpnneurologyorgcontent956e697fullincluding high resolution figures can be found at

References httpnneurologyorgcontent956e697fullref-list-1

This article cites 13 articles 1 of which you can access for free at

Citations httpnneurologyorgcontent956e697fullotherarticles

This article has been cited by 1 HighWire-hosted articles

Subspecialty Collections

httpnneurologyorgcgicollectionsubarachnoid_hemorrhageSubarachnoid hemorrhage

httpnneurologyorgcgicollectionintracerebral_hemorrhageIntracerebral hemorrhage

httpnneurologyorgcgicollectioninfarctionInfarction

httpnneurologyorgcgicollectioncohort_studiesCohort studies e

httpnneurologyorgcgicollectionall_cerebrovascular_disease_strokAll Cerebrovascular diseaseStrokefollowing collection(s) This article along with others on similar topics appears in the

Permissions amp Licensing

httpwwwneurologyorgaboutabout_the_journalpermissionsits entirety can be found online atInformation about reproducing this article in parts (figurestables) or in

Reprints

httpnneurologyorgsubscribersadvertiseInformation about ordering reprints can be found online

ISSN 0028-3878 Online ISSN 1526-632XWolters Kluwer Health Inc on behalf of the American Academy of Neurology All rights reserved Print1951 it is now a weekly with 48 issues per year Copyright Copyright copy 2020 The Author(s) Published by

reg is the official journal of the American Academy of Neurology Published continuously sinceNeurology

Page 11: Accuracyofidentifyingincidentstrokecasesfrom linked health ... · UK Biobank (UKB) is a prospective population-based cohort study with extensive phenotypic and genotypic information

References1 Lozano R Naghavi M Foreman K et al Global and regional mortality from 235

causes of death for 20 age groups in 1990 and 2010 a systematic analysis for theGlobal Burden of Disease Study 2010 Lancet 20123802095ndash2128

2 Burton PR Hansell AL Fortier I et al Size matters just how big is BIG quantifyingrealistic sample size requirements for human genome epidemiology Int J Epidemiol2009263ndash273

3 Sudlow C Gallacher J Allen N et al UK Biobank an open access resource foridentifying the causes of a wide range of complex diseases of middle and old age PLoSMed 201512e1001779

4 Woodfield R Grant I UK Biobank Stroke Outcomes Group UK Biobank Follow-Upand OutcomesWorking Group Sudlow CL Accuracy of electronic health record datafor identifying stroke cases in large-scale epidemiological studies a systematic reviewfrom the UK Biobank stroke outcomes group PLoS One 201510e0140533

5 WHO MONICA Project Investigators The World Health Organization MONICAproject (Monitoring Trends and Determinants in Cardiovascular Disease) J ClinEpidemiol 198841105ndash114

6 Ford DV Jones KH Verplancke JP et al The SAIL Databank building a nationalarchitecture for e-health research and evaluation BMC Health Serv Res 20099157

7 Viera AJ Garrett JM Understanding interobserver agreement the kappa statisticFam Med 200537360ndash363

8 Chubak J Pocobelli G Weiss NS Tradeoffs between accuracy measures for electronichealth care data algorithms J Clin Epidemiol 201265343ndash349e2

9 KirkmanMAMahattanakul W Gregson BAMendelow AD The accuracy of hospitaldischarge coding for hemorrhagic stroke Acta Neurol Belg 2009109114ndash119

10 Johnsen SP Overvad K Soslashrensen HT Tjoslashnneland A Husted SE Predictive value ofstroke and transient ischemic attack discharge diagnoses in the Danish NationalRegistry of Patients J Clin Epidemiol 200255602ndash607

11 Krarup LH Boysen G Janjua H Prescott E Truelsen T Validity of stroke diagnosesin a national register of patients Neuroepidemiology 200728150ndash154

12 Haesebaert J Termoz A Polazzi S et al Can hospital discharge databases be used tofollow ischemic stroke incidence Stroke 2013441770ndash1774

13 Wright FL Green J Canoy D et al Vascular disease in women comparison ofdiagnoses in hospital episode statistics and general practice records in England BMCMed Res Methodol 201212161

14 Gaist D Wallander MA Gonzalez-Perez A Garcia-Rodriguez L Incidence of hem-orrhagic stroke in the general population validation of data from the Health Im-provement Network Pharmacoepidemiol Drug Saf 201322176ndash182

NeurologyorgN Neurology | Volume 95 Number 6 | August 11 2020 e707

DOI 101212WNL0000000000009924202095e697-e707 Published Online before print July 2 2020Neurology

Kristiina Rannikmaumle Kenneth Ngoh Kathryn Bush et al Biobank

Accuracy of identifying incident stroke cases from linked health care data in UK

This information is current as of July 2 2020

ServicesUpdated Information amp

httpnneurologyorgcontent956e697fullincluding high resolution figures can be found at

References httpnneurologyorgcontent956e697fullref-list-1

This article cites 13 articles 1 of which you can access for free at

Citations httpnneurologyorgcontent956e697fullotherarticles

This article has been cited by 1 HighWire-hosted articles

Subspecialty Collections

httpnneurologyorgcgicollectionsubarachnoid_hemorrhageSubarachnoid hemorrhage

httpnneurologyorgcgicollectionintracerebral_hemorrhageIntracerebral hemorrhage

httpnneurologyorgcgicollectioninfarctionInfarction

httpnneurologyorgcgicollectioncohort_studiesCohort studies e

httpnneurologyorgcgicollectionall_cerebrovascular_disease_strokAll Cerebrovascular diseaseStrokefollowing collection(s) This article along with others on similar topics appears in the

Permissions amp Licensing

httpwwwneurologyorgaboutabout_the_journalpermissionsits entirety can be found online atInformation about reproducing this article in parts (figurestables) or in

Reprints

httpnneurologyorgsubscribersadvertiseInformation about ordering reprints can be found online

ISSN 0028-3878 Online ISSN 1526-632XWolters Kluwer Health Inc on behalf of the American Academy of Neurology All rights reserved Print1951 it is now a weekly with 48 issues per year Copyright Copyright copy 2020 The Author(s) Published by

reg is the official journal of the American Academy of Neurology Published continuously sinceNeurology

Page 12: Accuracyofidentifyingincidentstrokecasesfrom linked health ... · UK Biobank (UKB) is a prospective population-based cohort study with extensive phenotypic and genotypic information

DOI 101212WNL0000000000009924202095e697-e707 Published Online before print July 2 2020Neurology

Kristiina Rannikmaumle Kenneth Ngoh Kathryn Bush et al Biobank

Accuracy of identifying incident stroke cases from linked health care data in UK

This information is current as of July 2 2020

ServicesUpdated Information amp

httpnneurologyorgcontent956e697fullincluding high resolution figures can be found at

References httpnneurologyorgcontent956e697fullref-list-1

This article cites 13 articles 1 of which you can access for free at

Citations httpnneurologyorgcontent956e697fullotherarticles

This article has been cited by 1 HighWire-hosted articles

Subspecialty Collections

httpnneurologyorgcgicollectionsubarachnoid_hemorrhageSubarachnoid hemorrhage

httpnneurologyorgcgicollectionintracerebral_hemorrhageIntracerebral hemorrhage

httpnneurologyorgcgicollectioninfarctionInfarction

httpnneurologyorgcgicollectioncohort_studiesCohort studies e

httpnneurologyorgcgicollectionall_cerebrovascular_disease_strokAll Cerebrovascular diseaseStrokefollowing collection(s) This article along with others on similar topics appears in the

Permissions amp Licensing

httpwwwneurologyorgaboutabout_the_journalpermissionsits entirety can be found online atInformation about reproducing this article in parts (figurestables) or in

Reprints

httpnneurologyorgsubscribersadvertiseInformation about ordering reprints can be found online

ISSN 0028-3878 Online ISSN 1526-632XWolters Kluwer Health Inc on behalf of the American Academy of Neurology All rights reserved Print1951 it is now a weekly with 48 issues per year Copyright Copyright copy 2020 The Author(s) Published by

reg is the official journal of the American Academy of Neurology Published continuously sinceNeurology