Technical Guidance on Selecting and Implementing Predictive Modelling Solutions v1 · 2012-03-06 · Quality, Innovation, Productivity and Prevention (QIPP) is a large scale transformational

QIPP Digital Technology

© Crown Copyright 2012

Technical Guidance on Selecting and Implementing Predictive

Modelling Solutions Author: Mike Kelly Date: 1/03/2012 Version: 1.0

Technical Guidance on Selecting and Implementing Predictive Modelling Solutions

© Crown Copyright 2012 Page 2 of 73

Amendment History:

Version Date Amendment Histo ry 0.1 First draft for comment 0.2 Second draft for internal review 0.3 Third draft for external review 1.0 1/03/2012 Final version 1.0.



CONTENTS

1. PURPOSE 5

1.1. Scope 5 1.1.1. Out of Scope 6

1.2. Intended Audience 6

1.3. Document Conventions 6

2. BACKGROUND 7

2.1. Long Term Conditions 7

2.2. QIPP Digital Technology Team 7

2.3. Approach for the Development of this Guidance 8

3. SUMMARY 9

4. A SHORT INTRODUCTION TO PREDICTIVE MODELLING 10

5. A PREDICTIVE MODELLING SOLUTION FRAMEWORK 13

5.1. Conceptual Solution Architecture 15

5.2. Logical Solution Architecture 16

6. MODEL BUILDING OR SELECTION 17

6.1. Data Sources 17

6.2. Data Quality 22

6.3. Data Sharing 24

6.4. Data Linkage 26

6.5. Data Storage 28

6.6. Model Creation 30

6.7. Model Internal Validity 32

6.8. Model External Validity 33

6.9. Model Representation 34

6.10. Model Maintenance 41

7. PREDICTION TOOLING 44

7.1. Model Choice 44



7.2. Marshalling Input Data 45

7.3. Tool Integration 48

7.4. Tool Function and Use 50

7.5. Implementation 54 7.5.1. Development/selection 54 7.5.2. Integration Testing 55 7.5.3. Hosting 55 7.5.4. Operational Management 56 7.5.5. Medical Device Status 56

8. PREDICTION APPLICATION 60

8.1. Prediction Representation 60

8.2. Prediction Patient Record Status 62

8.3. Storage 62

8.4. Downstream Integration 63

9. PREDICTION USER ADOPTION 65

10. REFERENCES 66

APPENDIX A – CPM PMML 67

APPENDIX B – CPM TEST DATA 73



1. Purpose QIPP LTC has identified the need to provide technical guidance for local teams who are looking at implementing or extending the use of risk stratification. The benefits of such guidance are to increase the efficiency and effectiveness of local delivery of risk stratification and general predictive modelling solutions. QIPP LTC has sponsored QDT to provide this technical guidance for distribution to local QIPP LTC teams and the wider NHS in general. This document presents the QDT technical guidance for selecting and implementing predictive modelling solutions within the NHS.

1.1. Scope For any technology solution to be implemented within a healthcare setting, a wide range of areas need to be considered. This document is not intended to cover every aspect of the delivery of a solution. The below diagram gives a general overview of some of the areas you may need to consider – the areas that are addressed (at least in part) in this document are shown in green:

ServiceManagement

ChangeManagement

ClinicalSafety

BenefitsRealisation

Procurement /Contract Mgmt

BusinessScenarios

InformationGovernance

InfrastructureSecurity

Accreditation /Assurance

ISBStandards

Clinical Coding/ Terminology

ITKSpecifications

This guide provides:

• An overview of the basic concepts of predictive modelling • A predictive modelling solution framework – that provides a simple way

of understanding the fundamental stages in implementing predictive modelling

• For each stage of the solution framework technical considerations and recommendations are elaborated



1.1.1. Out of Scope This guide does not provide:

• A statistical discussion of the strengths and weaknesses of different types of predictive models

• A statistical discussion on validating and evaluating predictive models • Consideration of the costs and benefits of implementing predictive

modelling Useful discussion of these topics can already be found in [Ref: 3] and [Ref: 4].

1.2. Intended Audience This guide is intended to be read by technical staff in health and social care provider and commissioner organisations who are engaged in planning, design, implementation and operation of predictive modelling solutions.

1.3. Document Conventions In order to aid clarity, a number of conventions have been followed in this document:

• Where additional sources of information are referenced in the text, a reference number will be provided linking it with the appropriate entry in the referenced section at the end of this document – e.g. [Ref: 1].



2. Background Quality, Innovation, Productivity and Prevention (QIPP) is a large scale transformational programme within the NHS, involving all NHS staff, clinicians, patients and the voluntary sector [Ref: 1]. It will improve the quality of care the NHS delivers whilst making up to £20billion of efficiency savings by 2014-15, which will be reinvested in frontline care. At a regional and local level Strategic Health Authorities have been developing integrated QIPP plans that are supported by national QIPP workstreams which are producing tools and programmes to help local change leaders in successful implementation.

2.1. Long Term Conditions The Long Term Conditions (LTC) workstream, lead by Sir John Oldham, seeks to improve clinical outcomes and experience for patients with long term conditions in England by focusing on improving the quality and productivity of services for these patients and their carers so they can access higher quality, local, comprehensive community and primary care. This will in turn, slow disease progression and reduce the need for unscheduled acute admissions by supporting people to understand and manage their own conditions. The workstream seeks to reduce unscheduled hospital admissions by 20%, reduce length of stay by 25% and maximise the number of people controlling their own health through the use of supported care planning. The workstream aims to replicate this performance nationally by 2013/14. One of the three key principles of the QIPP LTC workstream is the adoption of Risk Stratification (a form of Predictive Modelling) to ensure that commissioners understand the needs of their population and manage those at risk. This will assist in preventing disease progression and will allow for interventions to be targeted and prioritised.

2.2. QIPP Digital Technology Team QIPP Digital Technology has been established as a function under the QIPP programme to assist QIPP national workstreams and local teams in exploiting digital technology in order to accelerate delivery of their QIPP priorities [Ref: 2]. The function focuses on helping to overcome digital challenges and barriers, to accelerate delivery, to spread initiatives and to maximise the potential value from technology enabled healthcare delivery.



A core principle of this operating model is to ensure that any work conducted or national enablers provided, have direct traceability back to key business drivers, and that work is only undertaken where there is a local ‘pull’ for national assistance.

2.3. Approach for the Development of this Guidance Within health and social care predictive modelling is often talked about in terms of risk stratification and risk profiling. It is important to understand that these are fundamentally just specific applications of the more general approach of predictive modelling to these domains, and that the field of predictive modelling has its own more generic concepts and terminology. Predictive modelling itself is normally viewed as a part of Data Mining. Therefore the approach of this guide is to frame many of its considerations in a top down Data Mining to Predictive Modelling fashion.



3. Summary This guide provides a set of detailed technical considerations and recommendations for predictive modelling solution designers and implementers based on a three stage framework for predictive modelling (Model Building or Selection – Prediction Tooling – Prediction Application). It includes technical guidance on:

• Model building or selection – covering; data sources, data quality, data sharing, data linkage, data storage, model creation, model internal validity, model external validity, model representation (including a full Predictive Model Markup Language representation of the Combined Predictive Model) and model maintenance.

• Prediction tooling – covering; model choice, marshalling input data, tool integration and implementation.

• Prediction application – covering; prediction representation, prediction patient record status, storage and downstream integration.

Consideration is also given to best practice for prediction user adoption.



4. A Short Introduction to Predictive Modelling Predictive modelling can be used to estimate the probability of a range of future events happening for individuals. These events can relate to an individual’s health or social need. For example:

• Jack has a 1.2% chance of having an emergency admission in the next 12 months

• Jill has a 0.5% chance of developing diabetes Predictive modelling can also be used to estimate the probability of a range of future events happening within populations. Individuals with similar probabilities within a population can then be grouped together into strata (as is done in the well known Kaiser Pyramid in health care). For example:

• 3.2% of people in Leeds have a 60% chance of having an emergency admission in the next 12 months

Many of these events have a direct impact on the need, form and timing of health and social care delivery. Therefore by being able to predict future events interventions can be planned and executed to:

• Optimise individual health and social outcomes – for an individual ensure they have the best treatment and care in a proactive rather than reactive manner

• Optimise population health and social outcomes – for a care commissioner help balance their population needs with resources so they can drive down health inequalities and target interventions to individuals who are most at risk

• Increase efficiency and effectiveness of health and social service delivery – care providers and commissioners can reduce ineffective delivery by matching delivery to predicted need at an individual level

• Reduce cost of health and social service delivery – care providers and commissioners can reduce cost by reducing inappropriate and ineffective service delivery to individuals who will not benefit while minimising future costs by early identification and intervention for individuals who will benefit

Predictive modelling is therefore applicable in both commissioner and provider contexts, potentially across all health and social care settings and all health and social care professionals. Predictive modelling technically generates a “probability” of an event happening in the future. A probability ranges from 0.0 (will never happen) to 1.0 (is absolutely guaranteed to happen).



Any probability in-between is a guess or an estimate, it is not a guarantee. The subject of probability is a complex one and surprisingly there is no agreement among statisticians and mathematicians as to what probability actually means or how it should be interpreted1. A probability value might therefore best be viewed as the likelihood2 of something happening in the future. A value close to 1.0 is very likely to occur, but there is no guarantee it will.

A value close to 0.0 is very unlikely to occur, but again there is no guarantee it won’t.

For a value close to 0.5 it is very difficult to decide if it is likely or unlikely to occur.

For example the probability of matching all six numbers in the lottery is 0.000000000000005 (but only if you actually buy a lottery ticket), which looks very unlikely to occur. However the probability of flipping heads in a coin toss is 0.5, which makes it difficult to decide if it is likely or unlikely. Probabilities can also be expressed as a percentage, for example the probability of flipping heads in a coin toss is 50%. Sometimes the word “chance” is used instead of probability. The term “Risk” is often used when referring to the probability of an adverse or unwanted event happening in the future. In health care, diseases or symptoms of disease would be regarded by most people as an unwanted event in their lives and the probability or chance of them happening would be seen as a risk. A consequence of morbidity is usually treatment and intervention from health organisations. Such treatments or interventions can be viewed as proxies or indicators for the causal morbidity and thus are often viewed as unwanted events in their own right with associated risk. For example most people and health professionals would regard emergency

1 For example the Frequentist and Bayesian schools of probability. 2 Likelihood is used here in the common vernacular sense. Statistics has its own specific definition of likelihood.



admission to a hospital, whatever the underlying cause, to be an unwanted event. Probability and risk can be presented as an absolute value or a relative value. For example:

• Jack has a 1.2% absolute risk of having an emergency admission in the next 12 months

• Jack has a 0.001% relative risk of having an emergency admission in the next 12 months compared to the average risk of emergency admission in the next 12 months for a specific CCG/PCT population

When making any statement of probability or risk it is important to be clear about the following:

• The subject the probability applies to – in the examples above it is Jack.

• The event the probability is predicting – in the first example above it is emergency admission in the next 12 months. Most events have some time constraint placed on them. The example is constrained to the next 12 months. Some events have an implicit time constraint, for example developing diabetes is implicitly constrained to the mortality of the individual. For explicit time constraints it is often important to know when the time constraint starts, this is often when the prediction has been made.

• The type of value – a pure probability or a percentage for example. • The value of the probability/risk. • If the probability/risk is a relative measure then it is important to identify

what it is relative to as illustrated in the second example above. There are many ways to make a prediction in health or social care. Many health professionals make intuitive predictions based on their own skills and experience. There are many well known issues with reliability, accuracy, repeatability and scalability associated with such intuitive predictions which makes predictions based on some formal model or algorithm preferable. Such formal predictive models are created by analysing a large set of historical data that contains many different data items that could act as input variables (sometimes called predictors) and the actual associated values of the event being predicted (output variable) for a large number of individuals. The analysis selects the subset of input variables that best seem to predict the output variable values and creates an algorithm or model that describes how to calculate a predicted outcome value based on values for the input variables. Once a predictive model has been created it can be implemented (coded) into a prediction tool. Input variable values can then be fed in for an individual and the tool will calculate and output the risk of the event happening for that individual.



5. A Predictive Modelling Solution Framework To understand how predictive modelling solutions can be developed and implemented a simple three stage framework is presented. This framework will be used to contextualise the technical considerations and recommendations that follow.

DataSources

Model Building+

MaintenanceModel

Prediction

Application

PatientData

Interventions

Stag

e 1

Stag

e 2

Stag

e 3

Clinician

Clinical Leadership and Adoption

Stage 1 – Model Building or Selection Some organisations may decide they want to build their own predictive model because; it will reflect the specific characteristics of their population, there is no available model for the event of interest and/or they want control over model maintenance. To build a model they will need access to appropriate data sources and a method of analysing the data to produce a predictive model and test its accuracy. As models are created from historical population data and the characteristics of populations change over time, it can therefore be assumed that a model



once created will not stay valid indefinitely. Therefore any model will need to be recalculated or maintained over time. There is in fact a clear feedback loop between the effective use of the model and its validity. If the aim of the model is to predict and intervene to reduce the outcome event in the population, then the more successfully it is used the greater the impact on the population which in turn will reduce the accuracy of the model going forward. Some organisations may not have the resources, skills or desire to build their own predictive model; instead they want to select an existing model. Stage 2 – Prediction Tooling Having built or selected a predictive model it must then be operationalised in the form of a predictive tool that can input the appropriate patient data, calculate the probability of the outcome and output it. As for model building, some organisations may want to develop or commission their own bespoke tooling, while others may decide to use commercial or third party tools. Stage 3 – Prediction Application Once predictions have been made they need to be stored, managed and acted on by a clinician. This may result in interventions for a patient if appropriate. Key to the success of any predictive modelling solution is user adoption, and in health care this is often driven by clinical leadership and adoption.



5.1. Conceptual Solution Architecture Conceptually an end to end predictive modelling solution consists of:

• Data sources that feed into a prediction tool • A prediction tool implementing a prediction model to calculate and

output outcome predictions • Business applications such as risk stratification and case finding that

process the outcome predictions A prediction model is created by analysing data sources. Either an existing published prediction model, such as CPM, is used or a new bespoke prediction model is created.

Data Sources

Data Mining Prediction

Model

Data Sources

Prediction Tool

Implements

Business Applications:

Risk Stratification

Case Finding

….



5.2. Logical Solution Architecture An example of the logical architecture of an end to end predictive modelling solution is:

• Use of a Database (DB) or Data Warehouse (DW) platform to store and manage data feeds, predictions and business application processing of predictions.

• Use of Extract Transform Load (ETL) DB/DW services to get and prepare source data ready for prediction processing

• A bespoke prediction tool interfacing to the DB/DW. • Bespoke business applications interfacing to the DB/DW and also

using a Business Intelligence platform to provide any required analytical functionality.

• A portal front end to the prediction tool and business applications, which provides security and access control.

Database / Data Warehouse

Extract Transform Load

Data Sources

Prediction

Tool

Case Management

Risk Stratification

Business

Intelligence

Portal

In many organisations physical services for DB/DW, ETL, BI and Portal will already exist.



6. Model Building or Selection This section presents issues to consider when building or selecting a predictive model, and provides appropriate technical guidance.

6.1. Data Sources Any predictive model must be built from a sufficiently large corpus of data. In data mining terms this data consists of examples that a data mining analysis can extract patterns from – a model. An example represents the entity that is of interest, the entity of which you wish to ask questions. In health and social care the example normally represents an individual. Associated with an example is a collection of attributes, where an attribute measures a specific aspect of an example. In health care the age of an individual and their gender are attributes that are for example commonly used in most models. The collection of examples is normally called the training dataset, as in data mining terms the data mining analysis (also sometimes referred to as machine learning) uses the dataset for training the model. Choosing which attributes to include in an example and which examples to collect into a training dataset are outside of the scope of this guidance. However in practical terms such decisions will be constrained by the available data sources. The attributes selected must contain the attribute(s) that represent(s) the desired outcome(s), for example emergency admission in the next 12 months. Other attributes represent mining fields that the data mining algorithm will process and select a subset for inclusion in the model as predictor attributes. Additional supplementary attributes will also normally be included that are used to identify or annotate individual examples but are not used by the data mining algorithm. In health and social care the scope of the training dataset is normally set by geographical boundaries or commissioner responsibility. This usually defines a baseline population of individuals for which appropriate attributes need to be sourced. For geographical boundaries census or electoral roll data sources could be considered, as well as the DH Personal Demographics Service (PDS). For any of these sources some individuals may be missing as they for example have not registered on their local electoral role. The Electoral Commission [Ref: 5] estimate that based on Census records the completeness of the electoral registers was at 91–2% in 2000. However there was a disparity of completeness across different geographical areas and social groups. For example in their study areas they found under-registration is notably higher than average among 17–24 year olds (56% not registered) and black and minority ethnic British residents (31%). For health and social care underrepresentation of these social groups in the training dataset could well



lead to significant biases in the predictive model if causes of morbidity and mortality are related to social, environment, economic or ethnic dispositions. For more information on the Census see: http://www.ons.gov.uk/ons/guide-method/census/2011/index.html For more information on Electoral Rolls see: http://www.electoralcommission.org.uk/elections/voter-registration For more information on PDS see: http://www.connectingforhealth.nhs.uk/systemsandservices/demographics For commissioner responsibilities existing provider registration data sources could be considered. For health commissioners such as Primary Care Trusts (PCT) and the new Clinical Commissioning Groups (CCG), patients registered with their constituent GP Practices can be used. This data can be sourced either directly from individual GP systems, centrally from NHAIS systems that hold patient registration data for PCT GP Practices or from PDS, which as the national electronic database of NHS patient demographic details holds data such as name, address, date of birth and NHS Number. Patient registration data from GP systems can be accessed using the system’s appropriate proprietary interface. As this is different across different vendor’s systems where a population needs to be defined based on data from several different types of GP systems, the system integration effort to retrieve and collate data increases. NHAIS has the advantage of providing the same interfaces to all GP patient registration data. Specifically NHAIS provides the Organisation Links interface and the Open Exeter Web Service interface that allow both the pushing and pulling of registration data. PDS also provides the opportunity of a single interface to all national GP patient registration data. For more information on NHAIS see: http://www.connectingforhealth.nhs.uk/systemsandservices/ssd/prodserv/index_html#NHAIScore For more information on Organisation Links see: https://nww.openexeter.nhs.uk/nhsia/genhelp/links.jsp For more information on Open Exeter see: http://www.connectingforhealth.nhs.uk/systemsandservices/ssd/prodserv/vaprodopenexe/ Within health care modelling, mining fields will often be sourced from many different delivery areas such as:

• In Patient (IP) encounters • Out Patient (OP) encounters • Accident and Emergency (AE) encounters



• General Practice (GP) consultations • Prescription events

Data relating to these different types of encounters and which relate to the population to be included as examples in the training dataset, will be held in multiple health care provider systems. The existence of a relationship between a member of the population and any provider is difficult to predetermine except for their GP Practice. Therefore it is usually impractical to source data directly from providers as this would entail exhaustively polling every possible provider system to see if they had any relevant data with the consequent technical issues of discovery of systems and system integration. A better approach is to use centralised data sources which have collected data from individual providers. For IP, OP and AE data the Secondary Uses Service (SUS) provides such a centralised data source. For GP data there is currently no centralised data source, although the General Practice Extraction Service (GPES) may be able to provide this in the near future. Therefore GP data needs to be sourced from individual GP systems. This can either be achieved through using GP system proprietary interfaces or by using MIQUEST. MIQUEST provides a single interface to all GP systems that allows you to formulate data queries in a Health Query Language (HQL) to retrieve specific patient medical data. For prescription data, health providers such as GP practices and hospital clinical systems will contain prescription information within the medical records for individuals. For social care and community care data there is no centralised data source. The information systems for individual providers will need to be used. Currently there are no operational predictive models that use social care data. There will very soon be a standard community information data set defined by the ISB. For more information on SUS see: http://www.connectingforhealth.nhs.uk/systemsandservices/sus For more information on GPES see: http://www.ic.nhs.uk/gpes For more information on MIQUEST see: http://www.connectingforhealth.nhs.uk/systemsandservices/ssd/prodserv/vaprodmiquest/ For more information on proposed community care data set see: http://www.ic.nhs.uk/services/in-development/community-information-programme/community-information-data-set-cids http://www.isb.nhs.uk/documents/isb-1510/amd-25-2010/index_html



Irrespective of which data sources are used it is recommended that a comprehensive data dictionary is constructed for each individual data source. This will be needed for subsequent processing of data in the model building process. A data dictionary will consist of a list of data dictionary items. Each item corresponds to an attribute and has the following properties:

• Name – this is often just the name or identifier of the data item or field in the data source, this should be unique within the data dictionary for the data source.

• Description – attribute names are sometimes cryptic, therefore it is a good idea to have a textual description of what the attribute represents.

• Type – this defines the data type, for example numeric or alphanumeric.

• Size – most data sources will set a maximum size for an attribute, for example a character string of up to 50 characters.

• Missing Value – some data sources may be able to distinguish when a value for an attribute is missing for whatever reason, and be able to record this fact. This is often done by using a specific value that normally an attribute couldn’t have to represent a missing value for the attribute.

• Unknown Value – some data sources may be able to record the fact that the attribute value is unknown. For example an attribute might represent “age of onset of condition” which in a source system is elicited directly from a patient. The patient may not know and the source system might record it as unknown. Representing an unknown value is often done by using a specific value that normally an attribute couldn’t have. However where an attribute is coded (see next property) the code set often includes a code for unknown.

• Codes – some attributes represent code values. The associated code set should be documented. This normally consists of a description for every code value. However some code sets maybe very large, for example SNOMED, in which case an explicit reference to the code set will be sufficient.

Note that most people do not normally maintain separate data dictionaries for each data source, but collapse them into one data dictionary with an additional Data Source property. Data dictionaries can be collated using a range of software tools ranging from specialised data modelling software to a simple Excel spreadsheet as shown below.



Data from a data source will have a data structure or data model. It is recommended that this is documented as it will also be needed for subsequent processing of data in the model building process. In the majority of cases this can usually be described in terms of the familiar Entity Relationship (ER) representation. An ER representation defines the data entities with their attributes and the relationships, roles and cardinalities between them. ER representations can also be recorded using a range of software tools ranging from specialised data modelling software to a simple Visio diagram as shown below.

Care Episode

PK,FK1 NHS Number

PK Episode Identifier

Start Date

End Date

Care Setting

Patient

PK NHS Number

Forename

Surname

DOB

Gender

Diagnoses

PK,FK1 NHS Number

PK,FK1 Episode Identifier

Diagnosis Code

Treatments

PK,FK1 NHS Number

PK,FK1 Episode Identifier

Treatment Code

Treatment Setting

Data received from a data source will have a physical format. It is recommended that this is documented, as you need to understand the format to be able to physically load the data into your data storage. This format could be for example:

• XML file or message • CSV text file • EDI message

Data received from a data source will be transported using a specific mechanism. It is recommended that this is documented, as you need to understand how to physically move data from a data source system to your own data storage. Transport mechanisms could be for example:



• FTP • Email • Web services • Physical media CD/DVD

Consideration should obviously be given to access authentication and security of transport for all mechanisms.

Summary of Technical Guidance for Data Sources When building a predictive model:

1. Create and maintain a comprehensive data dictionary for all data sources.

2. Create an ER data model for all data sources. 3. Document data formats for all data sources. 4. Document data transport methods for all data sources.

When selecting a predictive model:

1. Ask which types of data (IP, OP, GP etc.) were used within the training dataset.

2. Ask which data sources were used (e.g. SUS) for the training dataset. 3. Ask when the data was extracted for the training set. 4. Ask the scope of the data in the training dataset, specifically the

historical scope (for example 4 years of data) and boundary scope (for example all GP registered patients in CCG X).

6.2. Data Quality The accuracy of a predictive model will be affected by the quality of the data taken from the data sources. Data quality will be determined by the amount of missing values and by the amount of inaccurate or wrong values in the training dataset. Most data mining analysis methods do not deal well with missing values, however there are several common approaches used to eliminate missing values from the training dataset:

• Ignore examples containing missing values • Fill the missing values manually • Substitute the missing values by a global constant or the mean of the

examples • Get the most probable value to fill in the missing values – normally

called imputation It is recommended that a quantitative assessment of the amount of missing data is carried out and some form of imputation is used if required.



The issue of inaccurate or wrong attribute values is much more difficult to determine. It is not recommended that you assume that any data extracted from a data source is clean and error free (whatever the reassurances given by the source organisation). You should carry out your own due diligence to check the quality of the data you are going to use to create the training dataset. All data items within a data source will have a set of associated validation and credibility checks. A validation rule defines constraints that values for a data item should meet. This will also include constraints between values for different data items. Examples are:

• Value for data item X should be within a specific numeric range • Value for data item Y should be in a set of specific values (code set) • IF value for data item X defined THEN must be greater than value for

data item Z Where the same data item value is present in more than one data source, then you can also carry out cross validation across data sources. This acts as a measure of data accuracy. Whereas a validation rule is either passed or failed, a credibility check indicates where a data item value might be in error and prompts further investigation. For example:

• Value for data item X for most individuals will normally be within a specific range

Validation rules and credibility checks are often implicit; it is therefore recommended that they are explicitly documented. Validation rules and credibility checks can be implemented in a variety of ways. However the most common approach is to load the raw data from the sources into staging tables within a database and then implement the validation rules and credibility checks in bespoke code. The code traverses the database tables running each database table record through the appropriate rules and checks. The detailed results as well as aggregate counts are written to some form of data quality report. The code can be implemented either within the database itself, as for example by stored procedures, or in an external application, for example a Java or .NET application. Any identified errors in the data can be addressed in a similar manner to that of missing values as outlined above. This process is often called data cleaning. As for data quality checking this is most often implemented as bespoke code running against staging tables in a database.



Summary of Technical Guidance for Data Quality

When building a predictive model:

1. Measure the quantity of missing values in all data sources. 2. If required deal with missing values using imputation. 3. Document validation rules and credibility checks for all data sources. 4. Implement a data quality checker based on the validation rules and

credibility checks to produce a detailed data quality report. 5. If required implement data cleaning.


1. Ask about the quantity of missing values in the data sources used. 2. Ask what method was used to handle missing values. 3. Ask what quality checking was carried out on the data sources and ask

to see the data quality report. 4. Ask if the data sources were cleaned and the method used.

6.3. Data Sharing Using data sources to create a training dataset will involve sharing data between organisations within and across health and social care. Such data sharing will require careful consideration of Information Governance (IG) obligations, constraints and controls. Key to these issues is classification of data into patient identifiable data (PID) or anonymous patient-based data, as the former requires considerable tighter IG. Detailed consideration of data sharing is outside the scope of the technical guidance offered by this document. However the QDT is carrying out an investigation of data sharing within risk stratification which will produce a report outlining what existing IG guidance should be followed. A list of informative IG references is provided below: For more information on the Information Governance aspects of data sharing see: HM Government (2008). Information sharing: Guidance for practitioners and managers. https://www.education.gov.uk/publications/standard/publicationdetail/page1/DCSF-00807-2008 HM Government (2009). Information Sharing: Further guidance on legal issues. https://www.education.gov.uk/publications/eOrderingDownload/Info-Sharing_legal-issues.pdf Information Commissioner’s Office - Data Sharing Code of Practice. http://www.ico.gov.uk/for_organisations/data_protection/topic_guides/data_sh



aring.aspx Ministry of Justice (2003). Public Sector Data Sharing: Guidance on the Law. http://www.justice.gov.uk/downloads/guidance/freedom-and-rights/data-sharing/annex-h-data-sharing.pdf NHS - Information Governance Toolkit Knowledge Base. https://nww.igt.connectingforhealth.nhs.uk/KnowledgeBase.aspx?tk=409164355982355&lnv=8&cb=1f3405ee-2033-4b4c-8bd2-a7af931aab4e&sViewOrgType=0&sDesc=View+Entire+Knowledge+Base Richard Thomas and Mark Walport (2008). Data Sharing Review Report. http://www.justice.gov.uk/reviews/docs/data-sharing-review-report.pdf http://www.justice.gov.uk/reviews/docs/data-sharing-review-annexes.pdf GP systems use the Read coding system as a standard terminology for describing the care and treatment of patients. There are two variations of the Read coding system; READ2 and CTV3 and most GP systems support local codes in addition to the standard codes. Included in the Read coding system are specific codes to record a patient’s consent to share data in different contexts. These include: Read code Description 9Ndl Implied consent for core Summary Care Record dataset

upload. 9Ndm Express consent for core Summary Care Record dataset

upload. 9Ndn Express consent for core and additional Summary Care

Record dataset upload. 9Ndo Express dissent for Summary Care Record dataset upload 93C0 Consent given for upload to local shared electronic record. 93C1 Refused consent for upload to local shared electronic record. 9Nd7 Consent given for electronic record sharing. 9NdG Consent given to share patient data with specified third party. 9NdH Declined consent to share patient data with specified third

party. 9NdJ Consent withdrawn to share patient data with specified third

party. 9NdR Unable to consent to information sharing. Some contexts are precisely defined (for example SCR see: http://www.connectingforhealth.nhs.uk/systemsandservices/scr/documents/consentcodes.pdf ). Other contexts are more ambiguous and not specific to data sharing requirements for a specific model building exercise or subsequent data marshalling process (see section 7.2 later). Therefore it is recommended that where you do use Read consent codes as a means of checking consent to share at a patient level, there is a common understanding in place across data providers, data processors and data consumers as to what consent these codes are being used to represent. You should also be aware that some



data extraction mechanisms maybe cognisant of these codes and subsequently block extraction of non-consenting patient records.

6.4. Data Linkage The data taken from different data sources will relate to specific individuals. To construct the training dataset these data need to be linked together so that all data items that belong to a single individual are identifiable. In the simplest case this is achieved by constructing a training dataset that consists of a single table where each row represents an individual (example) and each column a data item (mining attribute). All data mining products will be able to analyse training datasets in this structure, while some products can also analyse training datasets in a relational format where multiple tables are linked by foreign keys. However it is recommended that you try and construct a single table training dataset to ensure you do not restrict your choice of data mining products. Within health care, an individual is normally uniquely identified by their NHS number. Most health care records will be indexed by NHS number; therefore it is recommended this is used for data linkage. However not all health data may have an associated NHS number and even when present it may be incorrect. NHS numbers can be verified or unverified, where the former has been checked against PDS to ensure the NHS number is associated with the correct individual based on demographic data. Some health records will have the verification status included. The use of the NHS number in social care is currently not widespread; therefore linkage of data within and across social care data usually relies on matching based on a minimum of:

• Surname/Last Name • Date of Birth • Gender

Data sharing agreements may place restrictions on access to personal identifiers within the data provided by data sources. At one extreme is the provision of fully anonymised data which contains no data items that can be used to identify an individual. For data linkage such data is unusable as there is no method of linking data across multiple data sources. An intermediate position of pseudonymised data is where the personal identifiers have been removed and replaced with a unique identifier. The unique identifier can be linked back to the personal information via a key, which is held securely and separately from the pseudonymised data. There are basically three approaches to pseudonymisation:

1. A data source implements a local algorithm to apply pseudonymisation. Other data sources may implement their own different algorithm.

2. A data source implements a common algorithm to apply pseudonymisation. Other data sources implement the same algorithm.



3. A data source sends identifiable data to a central secure safe haven with clear identifiers using secure communications and pseudonymisation is applied centrally.

Approach 1 is impractical for data linkage as psuedonymised data from multiple sources which have used multiple algorithms will have different unique identifiers for the same individual. Approach 2 is practical but requires all data sources to implement the same algorithm; this may not be technically or commercially feasible. Approach 3 is practical but requires investment in the provisioning of such a central service. For more information see: http://www.connectingforhealth.nhs.uk/systemsandservices/pseudo It is recommended that the method of matching data is explicitly defined and documented to minimise the risk of mismatched data. If some data is mistakenly linked the training dataset will contain invalid associations of data values which will affect the external validity of any generated predictive model. When linking data some data records may not be matched and depending on the context of the data sources involved this can either be ignored or recorded as a data quality issue. Within the documented data model one entity is usually taken to represent the baseline population (for example list of registered patients with a GP Practice). Data linkage normally works out from this baseline population across the other data records from the different data sources. A data source (for example Inpatient Data) may or may not have any data that relates to an individual in the baseline population, so not matching any data for a specific individual is not a issue. The same data source is also likely to have lots of data relating to individuals not in the baseline population, which again is not an issue. However a data source may be expected to have a one to one relationship with the baseline population and if no data relating to an individual in the baseline population is found this should be regarded as a data quality issue. It is recommended that a quantitative assessment of the amount of missing data is carried out during linkage. As discussed previously in the data quality section, there are various approaches to dealing with missing data. However in data linkage complete data records may be found to be missing not just single data item values, so the recommended approach is to ignore examples with missing linkage data. Data linkage can be implemented in a variety of ways. However the most common approach is to build on the staging tables within a database approach used for data quality checking. The data linkage logic is implemented in bespoke code. The code traverses the staging database table containing the base population. For each record it finds it tries to link with other appropriate records in other staging tables. If a single table training dataset approach (as recommended) is being followed, then this linked set of records is written out as a single flattened record to a database table that will hold the training dataset. The detailed results of the linkage as well as aggregate counts are written to some form of data quality report. The code can be implemented either within the database itself, as for example by stored



procedures, or in an external application, for example a Java or .NET application.

Summary of Technical Guidance for Data Linkage When building a predictive model:

1. Create a single table training dataset. 2. Use NHS number and/or Last Name-Date of Birth-Gender to match

data. 3. Document the method used for linking data across multiple data

sources, including the form of pseudonymisation used if applicable. 4. Document how missing linkage data is handled, ignore examples with

missing linkage data is recommended. 5. Measure the quantity of missing linkage data.


1. Ask what method was used to link data across multiple data sources. 2. Ask what method was used to handle missing linkage data. 3. Ask to see the data linkage quality report.

6.5. Data Storage Data that represents raw data from different data sources, cleaned data and the resultant training dataset needs to be stored and managed. Storage can be implemented as flat files or within a Relational Database Management System (RDBMS). As raw data will often be supplied in different physical formats and both data quality checking and data linkage processing require flexible and uniform access to data, it is recommended that a RDBMS is used for data storage and management. There are many commercial RDBMS (for example Microsoft SQL Server and Oracle Database) and open source RDBMS (for example MySQL and FirebirdSQL) that are suitable for predictive model building. As preparing the training dataset is a process that is not time or resource constrained, data storage need not have high availability, high reliability or good performance. So for example a RDBMS running on a desktop computer may well be sufficient. However security and backup/recovery should be addressed. If the data contains personal identifiers then normal good practice security measures should be applied; for example password protect access to computer, set user login to RDBMS, set user role in RDBMS, set access control on database within RDBMS. Although availability and reliability are not important for data storage when model building, it is important to protect your data from accidents and failures. If you have spent many days carefully checking, cleaning and linking data into a training dataset, if you do not want to lose all that work through a hard disk failure you need to backup the data storage on an appropriate schedule. Also remember to test your recovery process.



Consideration should be given to the storage volumes required, which will be determined by the scope of the data being used to build the model. Even for reasonably large data volumes of around say 1 Terabyte, these can be comfortable accommodated on a modern desktop computer without the need to resort to specialised storage solutions such as Network Attached Storage (NAS) or Storage Area Networks (SAN). If storage space is an issue, then most Operating Systems and RDBMS support data compression technology. This will reduce required physical hard disk space but will introduce a performance penalty; however performance is not important in the model building process. Other issues to consider when deciding on an appropriate RDBMS to use are:

• Import facilities – most raw data will be supplied in flat file format and will need to be loaded into staging tables within a database.

• Export facilities – the resultant training dataset is often exported to a simple text file for processing by the data mining software. You may also want to be able to distribute your training dataset to other model builders so they can test their own models against your data (see the Model External Validity section later on).

• Programming facilities – all RDBMS will support SQL and most provide means of packaging SQL into routines often called stored procedures.

• Application Programming Interfaces (API) – different programming languages and runtime systems support different database connection APIs, for example ODBC, OLE-DB, .NET Data Providers and JDBC.

Summary of Technical Guidance for Data Storage

When building a predictive model:

1. A RDBMS is recommended for data storage and management. 2. Robust security and backup/recovery for the RDBMS should be

implemented. 3. The RDBMS should provide appropriate import, export and

programming facilities.



6.6. Model Creation The previous sections have discussed how a training dataset can created which will then be processed by data mining software to create a predictive model. The process to create a training dataset can be summarised as:

There are a wide range of different types of predictive models. These include:

• Association Rules • Baseline Models • Cluster Models • General Regression • K-Nearest Neighbours • Naïve Bayes • Neural Network • Regression • Ruleset • Scorecard • Sequences • Text Models • Time Series • Trees • Vector Machine

For each type of model the method of analysing the training dataset is different and the structure of the model produced is also different. A logistic regression model (a Regression model type) is an example of a model type that has been commonly used in health care. The PARR family of risk stratification tools and the more recent CPM are examples of logistic regression.



A logistic regression model is structured as a regression equation of the form:

Log odds = Intercept + Variable1 * Beta1 + Variable 2 * Beta2 + … + VariableX * BetaX Probability = 1 / (1 + (exp(Log odds)))

Where Variable represents the value of a predictor attribute (for example gender) and Beta is a coefficient value to multiple it by (for example 0.3456), and Intercept is a constant. By using the model Beta and Intercept values with Variable values for a specific individual the probability of the outcome for the individual can be calculated. There are many statistical debates about the strengths and weaknesses of different types of predictive model which are outside the scope of this guidance. There is also some debate in the health care domain focused on the explanatory power of different model types. As stated previously, most of the health care predictive models are Regression. This in part can be explained by the fact that most model builders working in this area come from a medical statistics background and are more familiar with statistical modelling than the more generic data mining. This does however lead to some confusion, as regression analysis is also used to determine statistically significant relationships between different variables – which is then often used to infer causal relationships between variables. For predictive modelling however all we are concerned about is the ability of a model to accurately predict, not how it achieves it (apart from its internal validity as discussed later). Some model types, for example Neural Networks, are very hard to understand how they work compared to for example regression equations. This is sometimes used as an argument to criticise some model types on lack of explanatory power. There is a wide range of data mining products available that can be used to build predictive models, some of which are also open source software. Below is a representative list of companies/projects that offer data mining software: Company/Project Software Angoss KnowledgeSTUDIO

KnowledgeSEEKER StrategyBUILDER

Augustus / Open Data Group Augustus IBM InfoSphere Warehouse KNIME KNIME KXEN Analytic Framework Microsoft SQL Server MicroStrategy MicroStrategy Data Mining Services Pervasive DataRush Pervasive DataRush Rapid-I RapidMiner Salford Systems CART

TreeNet Mars



SAND Technology SAND CDBMS SAS SAS Enterprise Miner SPSS Clementine

PASW Modeler PASW Statistics SPSS

TERADATA Teradata Warehouse Miner TIBCO Software TIBCO Spotfire Miner Weka (Pentaho) Weka Choice of a suitable data mining product will be based on a variety of product features and local factors such as:

• The range of model types a product supports • The range of training dataset input methods • Authorised and supported products within your organisation • Product licensing and support costs • Existing product experience and skills within your organisation

A data mining product will process the training dataset and based on the type of model selected as the target model, will analyse the data to:

• Select which attributes best predict the outcome(s) – the predictors • Calculate the algorithm that should be applied to the predictors – for

example in a regression model this will be the coefficients and intercept • Output measures of the validity of the model • Output the calculated predictive model

Although the model building process has been presented as a linear one of creating the training dataset followed by model creation, in practise it tends to be a more iterative one – where based on the quality of the model you may decide to revisit the creation of the training dataset to try for example alternative data sources.

Summary of Technical Guidance for Model Creation When building a predictive model:

1. The choice of which type of predictive model to use should be based on power to predict rather than explanatory power.

2. The choice of which data mining product to use should be based on required product features and fit to local organisation technical context.

3. Approach model creation as an iterative rather than linear process.

6.7. Model Internal Validity Data mining products when calculating a predictive model will also generate a set of measures to allow you to assess the internal validity of the model. Internal validity refers to the internal consistency and conformance to



modelling assumptions of the generated model. Internal validity does not tell you anything about how accurate or useful the model is - this is termed external validity (see next section). A model that is not internally valid should not be used, irrespective of its external validity, as the model is not intrinsically sound. The relevant measures of internal validity are dependent on the type of model created and are outside of the scope of this technical guidance. However examples for regression models include:

• Multivariate distribution of data (normal distribution) • Homoscedasticity • No autocorrelation • Relevant sample size and power • Collinearity • No multilevel modelling effects

As can be seen for the above example, professional statistical and/or data mining advice, is often needed to verify the internal validity of a model.

Summary of Technical Guidance for Model Internal Va lidity When building a predictive model:

1. Select the appropriate measures of internal validity. 2. Consider using professional statistical and/or data mining advice to

assess the outputs from the measures. 3. Document a formal model internal validity report.


1. Ask what type of model has been created. 2. Ask what measures of internal validity for the model have been

produced. 3. Ask to see the model internal validity report.

6.8. Model External Validity Model external validity refers to measuring the predictive accuracy of the created model. Most data mining products can measure this in a variety of ways. The simplest and the one that seems to be most often used in health care modelling is to take a portion of the training dataset as a test dataset. For example it is common practise to collate a training dataset of four years of data for building a health care predictive model. The last year of data is then set aside or held-out as the test dataset. This test dataset is then run through the model and the predicted outcomes compared with the actual outcomes recorded in the data. This is referred to as hold-out estimate of predictive accuracy.



There are more elaborate methods of measuring predictive accuracy based on re-sampling and cross-validation that are outside of the scope of this technical guidance. Accepted measures of accuracy include:

• R-squared • Positive predictive value • Sensitivity • Specificity and negative predictive value • Various measures based on the Receiver Operating Characteristics

(ROC) curve For a description of these measures see [Ref: 4]. These models are often used to compare the accuracy of different models. However remember these comparisons are only valid if the models are predicting the same outcome(s). As a model builder it is also recommended that you consider making your training dataset and test dataset available to other model builders. They can then measure the predictive accuracy of their model directly against your data. Consideration must obviously be given to appropriate anonymisation of the data and its structure/format. As recommended previously, the training dataset is best created as a single table as this will aid portability between different data mining products.

Summary of Technical Guidance for Model External Va lidity When building a predictive model:

1. Select the appropriate measures of external validity. 2. Document a formal model external validity report. 3. Make your training and test datasets (suitably anonymised) available to

other model builders. When selecting a predictive model:

1. Ask what measures of external validity for the model have been produced.

2. Ask to see the model external validity report. 3. Ask if the training dataset and test dataset are available (suitably

anonymised) for use by other model builders.

6.9. Model Representation Most data mining products will output a created model in a proprietary format. This may be range from a proprietary binary file format to a simple on screen



textual description. Such proprietary formats of model representation make it difficult to describe and understand models in a consistent way, and inhibit the sharing of model descriptions and subsequent interoperability. What is needed is a standardised model representation format that allows the unambiguous description of all model types that can be read by both people and systems. Fortunately such a standard exists in the data mining community and is called Predictive Model Markup Language (PMML). PMML is an open standard maintained by the Data Mining Group (see http://www.dmg.org/). PMML uses XML to represent:

It supports the full range of model types previously outlined, including regression, neural networks and decision trees. A software product can support PMML as a producer and/or a consumer. Most of the data mining products listed previously support PMML as a producer, so created predictive models can be saved in PMML format. It is recommended that as a model builder you represent your predictive model in PMML format. Note some commercial predictive models are developed and then sold as black box predictive tools, where customers and users are restricted from knowing how the underlying model works. In these situations vendors are unwilling to share a model representation in PMML or any other format.



Although PMML can fully represent a predictive model it is recommended that as a model builder you also provide a data dictionary document describing the predictors and outcomes to aid model users in marshalling data for input to any prediction tool that uses the model. As a working example of PMML, the CPM as documented in [Ref: 6] has been taken and turned into a PMML representation as shown in Appendix A. The PMML file begins with a standard header: <Header copyright="(c) Crown Copyright 2011" description="Combined Predictive Model">

<Application name="COMBINED PREDICTIVE MODEL FINAL REPORT AND TECHNICAL DOCUMENTATION" version="1.0"/> <Annotation>A PMML description of the Combined Predictive Model, taken from the final report published in

2006</Annotation> <Timestamp>02/12/2011</Timestamp>

</Header>

It is then followed by a data dictionary which defines the variable types that can be used in the model. Note a new CPM variable called TXID has been added. This is used to represent a Transaction ID to uniquely identify an individual when creating a prediction. This can be set to any value by the system passing data to a prediction tool and will be returned along with the outcome probability. When anonymity is required the TXID can be generated by the client system of the prediction tool. It will then retain a mapping to a individual identifier such as NHS number, so that when it receives a outcome probability along with TXID it can relate it back to the specific individual. <DataDictionary numberOfFields="71"> <DataField name="TXID" displayName="Transaction ID which can be correlated to a patient identity" optype="categorical"

datatype="string" /> <DataField name="outcome" displayName="Probability of an emergency admission" optype="continuous" dataType="double" />

<DataField name="agegrp0004" displayName="Age 0-4" optype="categorical" dataType="integer"> <Value value="1" displayValue="Yes"/>

<Value value="0" displayValue="No"/> </DataField>


<Value value="0" displayValue="No"/>

</DataField>

… The data dictionary is followed by the model definition. For CPM this is a “logisticRegression” model type. The first part of the definition declares the mining schema, which identifies which variables in the data dictionary are used in the model and what role they play. <RegressionModel functionName="regression" modelName="DH_Combined_Predictive_Model" algorithmName="Unknown"

modelType="logisticRegression" normalizationMethod="logit">

<MiningSchema>

<MiningField name="TXID" usageType="supplementary"/> <MiningField name="agegrp0004"/>

<MiningField name="agegrp1539"/> <MiningField name="agegrp4059"/>



… This is then followed by the declaration of the regression intercept and variable coefficients. <RegressionTable intercept="-3.822847424">

<NumericPredictor name="agegrp0004" coefficient="0.289313618"/> <NumericPredictor name="agegrp1539" coefficient="0.385646264"/>



<NumericPredictor name="agegrp4059" coefficient="0.373238463"/>



… The utility of using PMML is that it is now possible to fully decouple the model producer from the model consumer (a prediction tool). As an example the company Zementis (http://www.zementis.com/) provide a product called ADAPA, which is a standards-based (PMML), real-time scoring engine (prediction tool) accessible on the Amazon Cloud as a service. To use the service you need to be registered (a free trial registration is available). You then login to the ADAPA console:

Having logged in you can upload a predictive model definition in PMML format, in our case the CPM definition shown in Appendix A.





ADAPA validates that the PMML file is valid and then effectively builds a predictive tool instance based on the predictive model definition dynamically. ADAPA then allows you to make predictions by either calling a set of web services or just by uploading a simple CSV text file that contains the predictor values for a set of cases, one case per line. Appendix B shows a test file for CPM that contains four test cases. The first three are made up examples; the fourth case is taken from the CPM report where a detailed worked example of using CPM is presented. In ADAPA you select to score or predict data. You then select the PMML model to use (CPM in our case which we have just uploaded) and upload the CSV data file to run through the model.



You can then view the outcome predictions and/or download the results as a text file. As you can see the probability for the example taken from the CPM report of 0.725 is the same as the value calculated in the report.



Summary of Technical Guidance for Model Representat ion When building a predictive model:

1. Represent the model in PMML format. 2. Provide an additional data dictionary document describing model

variables. When selecting a predictive model:

1. Ask to see a technical description (representation) of the model. 2. Ask if the model representation is available in PMML format. 3. If no representation is available – question the utility of selecting such a

model!

6.10. Model Maintenance A model is generated from a training dataset. A training dataset represents a baseline population over a specific time span. The characteristics of the baseline population will naturally change over time and as pointed out earlier the successful application of predictions made by the model as interventions will also affect change within the baseline population. Both these factors indicate that the external validity of a model may well change over time and thus a model should undergo regular maintenance. Model maintenance consists of first reassessing on a regular period the external validity of the predictive model. The period will be determined by the anticipated rate of change in the baseline population. This will be in the order of several years. Some model builders in the health care domain take three years as a rule of thumb.



The reassessment of external validity is normally carried out by building a retrospective test dataset for the same base population and from the same data sources that was used to create the initial training dataset. Data covering the time span from the last reassessment (or initial model creation) to the end of the period is used. As in the training dataset this will contain values for both predictors and outcomes. The process of creating this test dataset follows that already outlined previously in creating a training dataset. The same measures of external validity should be applied to the test dataset as was originally applied when creating the model. The values of the current and previous measures of external validity should be compared to determine if there is any trend in changes of external validity. The new values of measures of external validity should be used to decide if the model is no longer valid (accurate enough) and needs re-generation. If the model is no longer valid it needs to be re-generated. This can be achieved by using the test dataset as a new training dataset and building a new model using the data mining product as previously described. It is important to note that this new model may be significantly different from the previous model in that the new model may have different predictors. For example re-generating a logistic regression predictive model may not just create a model with different valued intercept and coefficients, it may drop predictors used in the previous model and introduce new ones. The implications of this are that model maintenance can have a profound impact on the operationalisation of a predictive model in the form of predictive tooling. If an in use model changes - both the predictive tool that uses the model will need to be changed and possibly the data marshalling process used to collate input data for the tool. Where predictive tooling and data marshalling are technically and/or commercially constrained any changes can be difficult and/or costly. Model maintenance should be based on model re-generation to ensure model internal validity is met. Model maintenance based on adjusting model parameters based on some form of parameter modelling, often called recalibration, should be avoided as the internal validity of the new model cannot be assessed.

Summary of Technical Guidance for Model Maintenance When building a predictive model:

1. Decide on the maintenance period. 2. Create a test dataset at each maintenance period. 3. Run your model external validity measures on the test dataset. 4. Re-generate your model using the test dataset as the training dataset if

external validity is no longer met.




1. Ask what the maintenance period is. 2. Ask what the maintenance process is.



7. Prediction Tooling This section discusses issues to consider when implementing a prediction tool, and provides appropriate technical guidance.

7.1. Model Choice A key decision to make, when either selecting or building a prediction tool, is which prediction model to use. There are a variety of well known models available such as PARR, CPM [Ref: 6], LACE [Ref: 7] and PEONY [Ref: 8] as well as many variants of these. There are also a growing number of “bespoke” local models being generated by model builders for specific local baseline populations (see for example Sussex Health Informatics Service: http://www.sussexhis.nhs.uk/our-services/sussexcpm/ and NHS Devon: http://www.devonpct.nhs.uk/Strategies/Devon_Predictive_Model.aspx). Irrespective of model type and model builder the following key criteria should be used in selecting a model:

• Outcome – what outcomes do you want to predict. Different models may predict different outcomes; there is no point in choosing a model that does not predict what you are interested in.

• Training dataset – what was the scope of the data used to create the model. This will include baseline population and data sources. This will determine the applicability of the model to your requirements. For example the PARR model used a baseline population of all acute patients in England, as such it did not include all GP registered patients or any GP data and therefore would not be considered suitable for predictions for primary care populations.

• Internal validity – what evidence is there that a model is internally robust? Is there a report?

• External validity – what evidence is there that a model is externally accurate? Is it more or less accurate than similar models?

• Maintenance – when was the model created? What arrangements are in place to maintain the model? Unfortunately most models do not seem to have maintenance arrangements in place, as many are produced by research groups as a one off exercise.

• Representation – how is the model documented/represented? Are there any constraints on access to the representation or use of it to implement your own predictive tooling? Note some commercial produced and supported predictive models are treated by vendors as proprietary and although they are happy to share internal and external validity reports will not provide a representation of the model.

• Data input – what input variables, predictors, does a model require to make a prediction? You will need to source and process the correct data to feed into a model. Do you have access to the appropriate data for a model? Is it of good enough quality to produce accurate predictions? How difficult will it be to process the source data into the model inputs?



You may find based on these criteria that you cannot find an existing predictive model that fits your requirements adequately. You will then have to decide to either compromise on your requirements or look at building your own “bespoke” model.

Summary of Technical Guidance for Model Choice Use the following criteria when selecting a predictive model for implementation within a predictive tool:

1. Model outcome. 2. Model training dataset scope – baseline population and data sources. 3. Model internal validity. 4. Model external validity. 5. Model maintenance. 6. Model representation. 7. Model data input processing.

7.2. Marshalling Input Data Any prediction tooling implementing a predictive model will require input data as predictor values to be able to calculate and output an outcome probability. The model description will detail what each predictor represents, its data type and where coded which code set it uses and which code values must be used. All this information should be available in the model representation. In most organisations the required input data will not be readily available in the exact structure and format required by the model. Therefore a process often called data marshalling needs to be implemented. This is very similar to the process used to create a training dataset:

• Identify and document appropriate data sources • Address any data sharing issues needed to gain access to source data • Check the quality of data and clean if necessary • Link data to create single input records for the predictive tool and check

linkage quality • Transform data in single input records where appropriate to match

model predictor data type and code set requirements As for model building it is recommended that a RDBMS is used for data storage and management of the data marshalling process. Data sources will include an appropriate baseline population source as well as data sources for health and social care records as discussed in section 6.1. An important issue to consider with data sources being used to marshal input data is one of latency. In the simplest case a data source is a direct provider, for example a GP Practice, where there is minimal delay or latency between information changing about an individual, that information being recorded on



the provider’s information system and it being subsequently available for access as a data source. However where a data source is an indirect provider, for example a centralised information broker such as SUS, there will be a latency between the changed information being recorded on the provider’s information system and subsequently being available on the indirect provider’s information system for access as a data source. For example the indirect provider may schedule updates to its information system from the direct provider feeds on a monthly basis. The situation becomes more complicated with multiple data sources with different latencies. It is often useful to create a scheduling diagram which sets out the data sources and their update schedules and latencies. This will help you decide when it is valid to calculate predictions. Using data sources to create input data for a predictive tool will involve sharing data between organisations within and across health and social care. Such data sharing will require careful consideration of IG obligations, constraints and controls. As discussed in section 6.3 detailed consideration of data sharing is outside the scope of this technical guidance, but a list of informative references is provided. The quality of input data will affect the accuracy of any predictions calculated by the prediction tool. You should therefore check the quality of your input data as discussed in section 6.2. Where appropriate and feasible you should also clean the data if there are data quality issues. Note however prediction tools normally handle missing values, so you normally do not need to use imputation. The data taken from different data sources will relate to specific individuals. To construct input records for the prediction tool these data need to be linked together so that all relevant data items used as predictors that belong to a single individual are collated as discussed in section 6.4. Having assembled input records for the prediction tool, these often require a final transformation of individual data items before they are compatible with the expected predictor data type and code set requirements. Data type transformation is relatively straight forward. For example a collated data item maybe an integer but needs to be transformed to a string to be compatible with the expected data type for the predictor. Code transformation, often called a code crosswalk, can be more problematic. A data item derived from data sources maybe coded using one code set, but the predictor requires values from another code set. To transform the code values you need to set up a code crosswalk which maps each individual code value from the source code set to the predictor code set. The crosswalk may map more than one source code value to the same predictor code value or in some circumstances no valid code mapping can be made. Some standard code crosswalks are available from TRUD, see: https://www.uktcregistration.nss.cfh.nhs.uk/trud/



The level of granularity as well as the semantic meaning in different code sets can make code crosswalks difficult to create. For example in the health care domain ICD-9 and ICD-10 codes (International Statistical Classification of Diseases and Related Health Problems) are often used by model builders to code mining attributes that represent disease diagnoses and treatments in a training dataset. Some prediction tool users will use HRG (Healthcare Resource Group) codes from centralised data sources, such as SUS. These represent standard groupings of clinically similar treatments which use common levels of healthcare resource. Therefore multiple ICD-9/10 codes will map to a single HRG code and conversely a single HRG code will map to multiple ICD-9/10 codes. It is recommended that the data transformation rules and code crosswalks are formally documented. Data transformation can be implemented in a variety of ways. However the most common approach is to build on the staging tables within a database. The data transformation logic is implemented in bespoke code. The code traverses the staging database table containing the assembled prediction tool data. For each record it finds it transforms the appropriate data items. The transformed set of records is written out to a separate database table that will hold the prediction tool input data. The detailed results of the transformation are written to some form of data transformation report. The code can be implemented either within the database itself, as for example by stored procedures, or in an external application, for example a Java or .NET application.

Data marshalling is likely to be a process that is both time and resource constrained, especially if the prediction process is wanted to be event driven (see later discussion). Therefore data storage will need to have higher levels of availability, reliability and performance than required for model building. You should therefore document the non-functional requirements for the data storage and management in detail. This should include:

• Performance - the speed (response time) or effectiveness (throughput) of a computer, network, software program or device.



• Reliability - the ability of a system or component to perform according to its specifications for a designated period of time.

• Availability - the ratio of time a system or component is functional to the total time it is required or expected to function.

• Scalability - the ability of a system to continue to function well when it is changed in size or volume in order to meet a user need.

For data marshalling server computer(s) are probably more appropriate than desktop computers. As discussed in section 6.5 considerations should be given to security, backup/recovery and RDBMS features.

Summary of Technical Guidance for Marshalling Data Input When creating a data marshalling process:

1. A RDBMS is recommended for data storage and management. 2. Create and maintain a comprehensive data dictionary for all data

sources. 3. Create an ER data model for all data sources. 4. Document data formats for all data sources. 5. Document data transport methods for all data sources. 6. Address any data sharing issues needed to gain access to source data. 7. Measure the quantity of missing values in all data sources. 8. Document validation rules and credibility checks for all data sources. 9. Implement a data quality checker based on the validation rules and

credibility checks to produce a detailed data quality report. 10. If required implement data cleaning. 11. Create a single table predictor tool input dataset. 12. Use NHS number and/or Last Name-Date of Birth-Gender to match

data. 13. Document the method used for linking data across multiple data

sources, including the form of pseudonymisation used if applicable. 14. Document how missing linkage data is handled; ignore cases where

missing linkage data is recommended. 15. Measure the quantity of missing linkage data. 16. Document the method of transforming data. 17. Report on the data transformation. 18. Document the non-functional requirements for data storage and

management.

7.3. Tool Integration A prediction tool will support some level of integration with its environment and interoperability with other systems and services. At a minimum a prediction tool must support:

• A method of inputting predictor values for one or more individuals. This input should include at least one data item that is not a predictor but is used to identify the individual (supplementary attribute).



• A method of outputting the outcome probabilities for one or more individuals. Associated with each outcome probability should be the identifier value inputted to identify the individual.

The most rudimentary methods for inputting and outputting are providing input screens to type in data and output screens to view results. Although useful for learning about a tool and for ad-hoc testing a prediction tool must support some form of data input and output to external storage. At a minimum this should be from and to text based files. Where a RDBMS approach, as recommended has been taken for data marshalling, the RDBMS will need to export the records in the Prediction Tool Input Data table to a text file. A prediction tool may support a direct interface to a RDBMS which will allow direct loading of input data from a database table, followed by direct storage of probability outcomes to an outcome table. A prediction tool may support an API that allows external systems to directly input data and retrieve probability outcomes programmatically. Such an API is desirable where the prediction tool is going to be programmatically integrated with other systems, especially where the prediction tool is going to be used in a real time environment where predictions for individuals are going to be made dynamically, for example as part of a discharge summary from a hospital. There are no standardised prediction tool APIs, therefore it is recommended that from a system integration viewpoint you develop your own façade API that your systems use to interface with the prediction tool. This façade API will in turn directly use the prediction tool API. By using a façade API you decouple your systems from the specific prediction tool you are using, which then makes it easier to change the prediction tool – you only need to recode the façade API. A prediction tool may support Web Services that also allows external systems to directly input data and retrieve probability outcomes programmatically. Web Services are similar to APIs but are however more suited to Service Orientated Architectures (SOA) where individual systems are decoupled into independent services that interface using standard internet protocols which allows physically distributed services running on disparate platforms and technologies to be easily integrated. There is no standardised prediction tool web service; therefore as for the API it is recommended that from a system integration viewpoint you develop your own façade web service that your services use to interface with the prediction tool. If you are considering using web services to interface to your predictive tool and plan to integrate systems across organisations and/or communities you may want to contact the NHS Interoperability Toolkit (ITK) community to see if you could collaborate in creating a national standard for a web service interface to predictive tools. See: http://www.connectingforhealth.nhs.uk/systemsandservices/interop for more details.



When interfacing to a prediction tool consideration should be given to the model of interaction. Do you require to push or pull data to the tool? Do you require to interact synchronously or asynchronously? API and web service can support push and pull, synchronous and asynchronous interaction. File input is normally suited to pull, where the prediction tool is instructed manually or via a schedule to load a text file. File output is normally suited to push, where the prediction tool will automatically or manually save a text file.

Summary of Technical Guidance for Tool Integration Use the following criteria when either selecting or building a predictive tool:

1. The tool should provide input/output screens for learning and ad-hoc testing.

2. The tool should provide a text file input/output interface. 3. The format of the text input file and text output file should be formally

documented. 4. The tool may provide an API interface. This may support push, pull,

synchronous and asynchronous interactions. 5. An API if provided should be formally documented. 6. When using an API for system integration it is recommended you

develop a local façade API. 7. Any façade API should be formally documented. 8. The tool may provide a Web Service interface. This may support push,

pull, synchronous and asynchronous interactions. 9. A Web Service if provided should be formally documented. 10. When using a Web Service for system integration it is recommended

you develop a local façade Web Service. 11. Any façade Web Service should be fully documented. 12. Any façade Web service should conform to web service infrastructure

standards published within the ITK.

7.4. Tool Function and Use When either selecting or building a prediction tool the functionality and style of use of the tool needs to be considered.



A suggested minimum set of functionality for a prediction tool is:

• Creating outcome predictions based on a predictive model and input data – sometimes referred to as the prediction engine; this is the core functionality of a prediction tool.

• Interface(s) to accept input data. • Interface(s) to produce output data. • A user interface (UI) for both administrators and end users. • Access control – to control users and external system access to the

prediction tool. • Error and audit logging. • Error and audit reporting. • Configuration management of engine, interfaces, access control and

logging. Consideration should be given to how configurable and adaptable a prediction tool is. Most tools will have the implementation of the predictive model “baked” into the product. Any changes to the model definition will then often required code changes to the product. As highlighted previously use of the standard predictive model representation PMML offers the opportunity of implementing highly adaptable predictive tools. The configurability of input and output interfaces to the tool is important. Interfaces that can easily be configured to input and output data in different structures will make system integration easier and cheaper. In addition to the minimum set of functionality a prediction tool may implement additional functionality around:

• Management – storage of outcomes so they are amenable to further analysis and display. Outcomes will be associated with an individual and there will normally be historical prediction outcomes for an individual that need to be retained to allow trend analysis.

• Analysis – grouping (stratification), calculating description statistics, statistical inference and trend analysis of outcomes.

• Display – graphical and text display of analysis results of outcomes and the raw outcomes both at individual and stratum levels. This may include integration with other health or social care data relating to an individual or stratum. This may include display in the context of a dashboard.

• Distribution – provision of outcomes and analysis of outcomes to other systems or services.

This type of additional functionality is normally classified as Business Intelligence (BI) or Analytics and there are many products available (some open source) that provide this type of functionality; however some prediction tools include this additional functionality as bespoke code. This can limit their future extensibility to handle new requirements compared to the use of a dedicated BI platform. When selecting or building a prediction tool it is therefore recommended that you consider the use of a BI platform to provide



the analytical functions - especially if you currently using a BI platform within your organisation to support other applications. This additional functionality is used to implement downstream business functionality within health care such as:

• Risk Stratification – stratification of baseline population patients into groups with similar predicted risk to help understand the case mix at both commissioner and primary care provider levels.

• Case Finding – identifying individual patients within a baseline population who have a predicted high risk to their primary care provider so they can initiate appropriate interventions.

• Resource Management – builds on risk stratification to understand resource implications of predicted case mix at both commissioner and primary care provider levels.

Some prediction tools do not distinguish between the prediction functionality and the downstream business functionality; for example providing a single tool for risk stratification that takes data from data sources and presents appropriate analyses and reports. Such all-in-one products initially may seem attractive in terms of reducing operational complexity and maintenance; however they often lack the flexibility to easily add new business functionality in the future and to change technology and application components easily. Therefore from an architectural viewpoint it is recommended that the prediction tool is considered to be a separately implemented function that then integrates with separate downstream business functions (such as risk stratification). This decoupling of the prediction functionality from the business application of the predictions will provide the maximum flexibility. In practise storage and management of the predicted outcomes is a necessary precursor for all of the other business functions. Therefore it could logically be regarded as a functional component of a prediction tool, however again to provide the maximum flexibility it is recommended this is decoupled and provisioned as a separate service. These considerations lead to recommended target architecture for end to end predictive modelling solution as shown below.



The storage and management of prediction outcomes and the downstream business functions are discussed in the following Prediction Application section. You need to consider how you want to use a prediction tool and match this to the capabilities of actual tools. The main determinate of use is when you need to make predictions. Making predictions can be time or event driven. Time driven prediction means you decide when you want to make predictions. For example in risk stratification it is common to build on a regular schedule (say every month) a input dataset with current predictor values for all individuals in the baseline population and then submit this dataset to a prediction tool. Event driven prediction means external events decide when you need to make a prediction. For example in an acute setting you may want an updated risk prediction made for an individual as part of their discharge process, this will be triggered by the discharge event. In terms of data processing, tools can be used in batch and/or transactional modes. The majority of health care uses of prediction tools currently operate in batch mode. In batch mode you require the tool to be able to handle inputting and outputting large datasets, but speed of processing is usually not critical. In transaction mode you require the tool to be able to handle lots of either small datasets or individual predictions, speed of processing is now important as you will be concerned about the through-put capability of the tool. Batch mode will be important when you are using time driven prediction. Transaction mode will be important when you are using event driven prediction. It is recommended that you access the batch and transactional capabilities of any prediction tool against your time and/or event driven usage requirements.



Summary of Tec hnical Guidance for Tool Function and Use

Use the following criteria when either selecting or building a predictive tool:

1. Should provide a prediction engine. 2. Should provide data input interface(s). 3. Should provide data output interface(s). 4. Should provide a user interface for administration and users. 5. Should provide access control. 6. Should provide audit and error logging. 7. Should provide audit and error reporting. 8. Should provide configuration management of functions. 9. Assess how adaptable the tool is, particularly with regard to changing

the predictive model used and configuring the input/output interfaces to deal with different data structures.

10. Assess the batch and transactional capabilities of the tool against your time or event driven prediction requirements.

11. Decouple the predictive tool from outcome storage and downstream business functions to provide maximum flexibility.

12. Consider the use of standard BI platforms to deliver downstream business functions.

7.5. Implementation Implementation of a prediction tool will involve:

• Development/selection of a tool • Integration testing of the tool • Hosting of the tool • Operational management of the tool

Standard system development and implementation guidance applies.

7.5.1. Development/selection

When designing and developing a tool you or the vendor should manage and control the system development life cycle through a well-defined set of policies and procedures. This should include unit testing, functional testing and acceptance testing (including the creation of “test rig” clients and end-points). Acceptance testing should test both the functional and non-functional aspects of the tool. For functional testing at an absolute minimum a set of test cases should be prepared that present a set of predictor values and the expected outcome probability. These test cases should be prepared independently of either the in-house team developing the tool or the external vendor supplying the tool. Note to prepare these test cases requires you to have a representation of the predictive model used by the prediction tool so you can independently calculate the outcome prediction (often unfortunately needed to



be done by hand!). This implies any prediction tool provided as a “black box” with no detailed definition of what predictive model is used cannot be properly acceptance tested. Many developers and vendors struggle with executing non-functional testing as this often involves using for example specialised performance measurement tools and rigs, and needing to instrument the application to produce the appropriate information to measure. All too often non-functional testing is relegated to a few rudimentary tests or is completely omitted. It is recommended that at least a few basic non-functional tests around capacity, performance, reliability and performance are carried out. Most of these can be simple stress tests where the system is increasingly loaded until it breaks. Choice of programming languages and frameworks for bespoke software development should be restricted to current mainstream technologies such as Java and .NET.

7.5.2. Integration Testing Following development and functional testing of a tool, a series of integration tests will be required. These tests will validate the communication of information to the tool from data sources via the data marshalling process, and from the tool to data consumers. An issue to consider here is what data to use. A small set of test cases will be adequate for basic testing; however you usually require a larger volume of data to test correct functioning of interfaces. This data may be “live data” representing data for actual individuals. If there are no security or IG risks (for example the data is anonymised or pseudonymised) with using live data for integration testing and there is no possibility that the results of processing this data are used for any health or social care purposes while integration testing, then this is acceptable. However even with these risks mitigated you may find that you are still not allowed to use live data for integration testing. You will then need to create large volumes of test data. This is best done by developing bespoke test data generators rather than attempting to craft test data by hand. Integration testing should also provide some form of non-functional testing, although the limitations and difficulties of such testing, as outlined in the previous section, should be taken into account.

7.5.3. Hosting There are many options for hosting a tool. These include:

• In-house – within your own data centre or server room • External public sector co-located – within an external data centre

owned and run by a public sector organisation but on infrastructure dedicated to your tool

• External public sector shared – within an external data centre owned and run by a public sector organisation but as a shared service that other organisations will use



• External private sector co-located – within an external data centre owned and run by a private sector organisation but on infrastructure dedicated to your tool

• External private sector shared - within an external data centre owned and run by a private sector organisation but as a shared service that other organisations will use

The choice of which hosting option to use will be driven by many factors such as:

• Cost • Security and IG constraints • Service levels • Network connectivity • Reliability • Availability • Performance • Disaster recovery • Business continuity

These factors will specific to each organisation implementing a tool.

7.5.4. Operational Management Operational management covers the planning, development, delivery and day-to-day operational procedures relating to the tool. This includes appropriate levels of service governance, project management, resource management and analytical reporting, as well as compliance to relevant standards and guidelines. ITIL (http://www.itil-officialsite.com/) provides the recommended best practice approach to IT service management which includes operational management.

7.5.5. Medical Device Status Since June 1998 it has been mandatory for medical devices marketed and sold within Europe to comply with the Medical Devices Directive 93/42/EEC. If a product has a medical purpose, i.e. it is specifically intended to provide or assist with the diagnosis, monitoring, prevention or treatment of a medical condition; it is likely to be a Medical Device under the Medical Devices Directive (MDD) 93/42/EEC. A Medical Device is defined in Article 1 clause 2(a) of the MDD. This will not only depend on the function of the device itself, but also on the claims made for its intended use in the accompanying documentation. Products with software that are only intended for transmitting / archiving / retrieving patient data, records or images without intended changes are thus



not regarded as Medical Devices. By analogy patient record software will fall into this category. If the software is intended to carry out further calculations, enhancements or interpretations of patient images or data, it is considered that to be a Medical Device. If it carries out complex calculations, which repla ces the clinician’s own calculation and which will therefor e be relied upon , then it will be considered a Medical Device. The opinion of the MHRS3 is that:

• Prediction software appears to have a screening function that may result in recommendations for further follow-up, and as such is likely to fall under the MDD.

• It would be classified as Class I under Rule 12 of Annex IX of the MDD and would therefore not require the intervention of a Notified Body, but be ‘self-declared’ in accordance with Annex VII of the MDD.

Low-risk (class I non-sterile, non measuring) devices require the manufacturer to self-declare conformity with the MDD and register the devices with an EU Competent Authority. Annex VII of the MDD states:

ANNEX VII

EC DECLARATION OF CONFORMITY 1. The EC declaration of conformity is the procedure whereby the manufacturer or his authorized representative established in

the Community who fulfils the obligations imposed by Section 2 and, in the case of products placed on the market in a sterile condition and devices with a measuring

function, the obligations imposed by Section 5 ensures and declares that the

products concerned meet the provisions of this Directive which apply to them.

2. The manufacturer must prepare the technical documentation described in Section 3. The manufacturer or his authorized representative established in the Community

must make this documentation, including the declaration of conformity, available to the national authorities for inspection purposes for a period ending at least five

years after the last product has been manufactured.

Where neither the manufacturer nor his authorized representative are established in the Community, this obligation to keep the technical documentation available must

fall to the person(s) who place(s) the product on the Community market.

3. The technical documentation must allow assessment of the conformity of the

product with the requirements of the Directive. It must include in particular:

- a general description of the product, including any variants planned,

- design drawings, methods of manufacture envisaged and diagrams of components, sub-assemblies, circuits, etc.,

- the descriptions and explanations necessary to understand the abovementioned

drawings and diagrams and the operations of the product,

3 Communication from Tore Johansen, Regulatory Affairs Manager MHRA



- the results of the risk analysis and a list of the standards referred to in Article 5,

applied in full or in part, and descriptions of the solutions adopted to meet the essential requirements of the Directive if the standards referred to in Article 5 have

not been applied in full,

- in the case of products placed on the market in a sterile condition, description of the methods used,

- the results of the design calculations and of the inspections carried out, etc.; if the

device is to be connected to other device(s) in order to operate as intended, proof must be provided that it conforms to the essential requirements when connected to

any such device(s) having the characteristics specified by the manufacturer,

- the test reports and, where appropriate, clinical data in accordance with Annex X,

- the label and instructions for use.

4. The manufacturer shall institute and keep up to date a systematic procedure to

review experience gained from devices in the post-production phase and to implement appropriate means to apply any necessary corrective actions, taking

account of the nature and risks in relation to the product. He shall notify the competent authorities of the following incidents immediately on learning of them:

(i) any malfunction or deterioration in the characteristics and/or performance of a

device, as well as any inadequacy in the labelling or the instructions for use which

might lead to or might have led to the death of a patient or user or to a serious deterioration in his state of health;

(ii) any technical or medical reason connected with the characteristics on the

performance of a device for the reasons referred to in subparagraph (i) leading to systematic recall of devices of the same type by the manufacturer.

You can access the full MDD at: http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=CELEX:31993L0042:EN:NOT For more information on medical devices see: Directives (click on Directive number) and list of Harmonised standards (click on arrow in fifth column): http://www.newapproach.org/Directives/DirectiveList.asp MEDDEV guidance documents: http://ec.europa.eu/enterprise/sectors/medical-devices/documents/guidelines/index_en.htm MHRA Website (Registration): http://www.mhra.gov.uk/Howweregulate/Devices/Registrationofmedicaldevices/index.htm MHRA Website (Notified Bodies): http://www.mhra.gov.uk/Howweregulate/Devices/NotifiedBodies/index.htm MHRA Website (How we Regulate Devices): http://www.mhra.gov.uk/home/idcplg?IdcService=SS_GET_PAGE&nodeId=48 MHRA website (Regulatory Guidance Notes): http://www.mhra.gov.uk/home/idcplg?IdcService=SS_GET_PAGE&nodeId=57



7 MHRA Website (Directives Bulletins): http://www.mhra.gov.uk/home/idcplg?IdcService=SS_GET_PAGE&nodeId=576 MHRA Website (Vigilance Guidance): http://www.mhra.gov.uk/home/idcplg?IdcService=SS_GET_PAGE&nodeId=197

Summary of Technical Guidance for Implementation Use the following criteria when implementing a predictive tool:

1. A system development life cycle should be used. 2. Mainstream development languages and technologies should only be

used in development. 3. Acceptance functional testing should use test cases created

independently of the build team or vendor. 4. Acceptance testing should include some non-functional testing. 5. Integration testing should be carried out. 6. Develop test data generators for integration testing if live data cannot

be used. 7. Determine hosting options based on local circumstances. 8. Operational management should follow best practice as defined in ITIL

v3. 9. If marketing and selling a predictive tool it is recommended it is ‘self-

declared’ in accordance with Annex VII of the MDD.



8. Prediction Application This section discusses issues to consider for prediction application, and provides appropriate technical guidance.

8.1. Prediction Representation An outcome prediction once created needs to be stored and used to be of utility. The representation of the prediction is usually in a proprietary format and in an implicit context. For example a prediction probability may be stored as a real number in a single column within a database table. The context of the prediction, that it represents say the probability of an emergency admission in the next 12 months and has used the CPM model, is not explicitly defined. It is implicit within the context of the users of this tool. However if you want to share a prediction with other systems and services, which are outside or your end to end solution, then the prediction context needs to be made explicit in the prediction representation. Unfortunately there are no standards based prediction representations as there is for prediction model representation, PMML. The key benefit of having such a standards based prediction representation would be to allow interoperability between different prediction tools and between prediction tools and downstream business applications. This section presents a set of candidate requirements which could be used to develop an outcome prediction representation standard within the context of existing health and social care technical standards such as HL7 V3. Requirement Description Value The actual value of the prediction. Type The type of the prediction value. For example a

probability, percentage or risk score. Value Range The minimum and maximum values a prediction

value can hold. For a probability the range is obviously 0.0 to 1.0. For a percentage the range is obviously 0.0 to 100.0. A risk score may have any arbitrary range, for example 1 to 15.

Is Missing An indication that the prediction value is missing. Is Unknown An indication that the prediction value is unknown. Creation Date Time The date and time of when the prediction was

created. Is Relative An indication that the prediction is relative rather

than absolute. Relative Reference If the prediction is relative, a reference to a

description of what it is relative to. This may have to be just a textual description although a set of basic codes could be devised around both geographical and health/social service boundaries. For example:

• Geog.Country.England



• Geog.Country.Scotland • NHS.CCG.<CCG id> • NHS.GP.<GP practice id>

Event A reference to a description of the event being predicted. This may have to be just a textual description although a set of basic codes could be devised. If the events relate to for example disease diagnoses or provider activities (such as emergency admission) then existing code sets could be used.

Event Range If applicable the start and end date and times for the event being predicted. For example if the event was “emergency admission in the next 12 months”, the start date and time would be the Creation Date Time, and the end data and time would be Creation Date Time + 1 year.

Model A reference to a description of the prediction model used to make the prediction. This may have to be just a textual description. However if a prediction model uses a standards based representation such as PMML, and this is placed in a repository that is freely accessible, then a true reference in the form of a URN or URL could be used.

Expiration If the prediction should only be used up to a certain date (use by data) this defines the expiration date time.

Creator A description of whom or what has created the prediction. For a prediction tool this will indicate vendor, product name and product version.

Derivation The same prediction outcome maybe recorded as different types. For example as a probability which is then turned into a national risk score and which is also turned into a CCG risk score. Where the same prediction is represented as different types within the same structure, derivation references (if relevant) the prediction it is derived from. Where relative and/or risk scores are used it becomes problematic how to compare predictions across domains. Is a relative probability (compared to the mean population probability) of 0.3 for CCG X worse or better than a relative probability (compared to the median population probability) of 0.2 for CCG Y? By being able to represent the same prediction as multiple types, with the base type being an absolute probability, and linking how these values are derived from each other, the prediction can be used for a greater variety of purposes.



Note neither the identity of the individual the prediction applies to (if known) nor any security or IG constraints are included in the candidate requirements as the prediction representation will be embedded within the a wider structure which will define this.

8.2. Prediction Patient Record Status One application of creating predictions is to inform the decisions made by health and social care professionals to initiate interventions with individuals (case finding). As such it is recommended that outcome predictions are regarded with equal status to other information such as medical diagnosis, examination results, and medical opinions etc. which are included in a patient’s medical record. Therefore an outcome prediction (when identifiable with a patient) should be placed in a patient’s medical record. Where a patient’s medical record is an EMR then this has the additional benefit of being able to use the prediction in a more flexible and powerful way when the EMR is being used for group based analysis. Many EMR’s have the capability to analysis and group patients into categories based on for example diagnosis related groups or treatments. Medical professionals can then assess the medical information of individual patients within each group to help inform their subsequent management and decide on service delivery priorities and load. By having the predictions as an intrinsic part of the EMR this now also becomes available for assessment irrespective of what grouping criteria are used.

8.3. Storage As indicated in section 7.4 the storage and management of generated predictions although a downstream application is one that all prediction tools require. As for model building and data marshalling it is recommended that a RDBMS is used for prediction outcome storage and management. As the prediction outcome RDBMS acts as a distribution hub to potentially many different downstream business applications, it needs to have adequate levels of availability, reliability and performance to satisfy their demands. Therefore server computer(s) are more appropriate than desktop computers. As discussed in section 6.5 considerations should be given to security, backup/recovery and RDBMS product features. Where you already have either a central corporate RDBMS or a central Data Warehouse it is recommended that you use it for model building, data marshalling and/or outcome repository.



8.4. Downstream Integration Downstream business applications use the generated prediction values for different purposes. In Long Term Conditions the two most common applications are Case Finding and Risk Stratification. Case finding is normally used by primary care providers to identify and track patients who have a predicted high risk for an outcome such as emergency admission in the next 12 months. Primary care providers can either:

• Use a common centralised shared case finding system • Use a dedicated centralised case finding system – one case finding

system instance per provider • Use a dedicated local case finding system

The centralised shared approach has the technical advantages that the integration between the outcome prediction storage and the business application is simplified as they are co-located and economies of scale can be achieved. However strict access control and security must be implemented so that data is logically partitioned between different providers. The centralised dedicated approach also has the integration advantages and now physically partitions data between different providers. However economies of scale are reduced as multiple individual instances of a case finding system must now be supported either on multiple physical or virtualised infrastructures. Strict access control and security must still be implemented. Both centralised approaches are dependent on reliable network connections between provider clients and the centralised applications. As most case finding is not time critical this should not be a major constraint. The localised approach has the technical advantages that access control and security are implemented locally and therefore can use existing provider security management systems. However outcome prediction data must now be transferred from central storage to local provider systems which in terms of network utilisation and import/export effort can be onerous. A case finding system should provide the following functionality:

• Allow definition of multiple groups based on configurable prediction outcome ranges

• Classify individual patients within a provider population into one of the groups based on their predictive value

• For each group calculate basic statistics of; number of patients classified within the group, mean predictive value, standard deviation predictive value

• Textually and graphically display the statistics for each group • List all patients within a group



• Textually and graphically display historical group data – highlighting statistical trends

• Textually and graphically display historical patient data – highlighting individual climbers and fallers

• Drill down from any group function to an individual patient – ideally this should allow linkage into the provider’s EMR

• Textually and graphically display historical data for an individual patient Where urgent care clinical dashboards are being implemented, it is recommended that you consider linking your case finding system into it. Technical approaches to risk stratification are similar to case finding. The major difference is that patients are not normally identifiable within risk stratification; it is the statistical properties of the groups or stratum that are of interest. In both case finding and risk stratification it is recommended that measures to record both the usage and utility of the systems are implemented. Usage is normally relatively easy to measure as it relates to activities such as for example number of user logons per week. Utility is more difficult to measure, but is crucial to evaluating the benefits any system is delivering. ITIL V3 provides a useful definition of utility and distinguishes it from the concept of “warranty”:

Utility – fitness for purpose. Functionality offered by a product or service to meet a particular need. Utility is often summarized as “what it does”. Warranty – fitness for use. A promise or guarantee that a product or service will meet its agreed requirements. The availability, capacity, continuity and information security necessary to meet the customer’s requirements.

(ITIL V3) Both soft measures of utility, asking users what they use within a product, and hard measures of utility, analysing product logs to see which product functions and features are actually used, can be used. Where possible it is recommended you try to measure the actual benefits delivered by case finding and risk stratification. The NHS Institute for Innovation and Improvement offer a useful methodology for measuring benefits. See: http://www.institute.nhs.uk/quality_and_service_improvement_tools/quality_and_service_improvement_tools/methodology_for_measuring_benefits.html



9. Prediction User Adoption A technically robust predictive modelling solution may still fail due to problems with user adoption. The usability of the solution by health care and/or social care professionals is a significant factor impacting end user adoption. You can improve usability by:

• Carry out usability testing of solution UI and checking conformance to the NHS Common User Interface (CUI) standards

• Designing solution UI to conform to accepted industry UI guidance for the particular UI platform being used, for example Microsoft Windows, Web Browser or Android device.

• Provide adequate end user training • Provides ongoing end user support through a help desk

Introduction of a predictive modelling solution may well introduce changes in the work distribution in health care or social care teams. For example the LTC guidance indicates that you need to look at the risk scores of the top 5% patients in your population on a regular basis, this may be a significant volume and you will need to work out how the local team(s) can work through these systematically. A predictive modelling solution may be constrained in how quickly it can generation prediction values. This may impact the professional usefulness of the prediction values and consequently user adoption. If prediction values cannot be generated and used in a timely manner, than both its warranty and utility will be perceived as poor by users. Look for opportunities to present the prediction values in a context where it helps decision making and prompts interventions. This normally involves system integration with other systems that are routinely used by health care and social care professionals. For example including patient risk scores on an Urgent Care Clinical Dashboard. Health care and social care leadership, engagement and change management will be key to any predictive modelling solution implementation. As with any programme/project an established and proven methodology such as MSP/PRINCE2 is recommended, which iterates just common sense such as:

• Write a business case that clearly demonstrates it is worth doing • Get a sponsor who is going to champion it • Set down clear objectives • Do some planning



10. References These resources will provide additional information, and are referenced in the relevant sections in this document. Ref no Title Location 1 DH: QIPP Page http://www.dh.gov.uk/en/Healthcare/Qualitya

ndproductivity/QIPP/index.htm 2 NHS Networks: QIPP Digital

Technology and Vision http://www.networks.nhs.uk/nhs-networks/qipp-digital-technology-and-vision

3 Risk Stratification: A Practical Guide for Clinicians.

Miller C., Reardon J. and Safi J. (2001). Risk Stratification: A Practical Guide for Clinicians. Cambridge University Press.

4 Choosing a predictive risk model: a guide for commissioners in England.

Lewis G., Curry N. and Bardsley M. (2011). Choosing a predictive risk model: a guide for commissioners in England. Nuffield Trust.

5 The completeness and accuracy of electoral registers in Great Britain.

The Electoral Commission (2010). The completeness and accuracy of electoral registers in Great Britain.

6 Combined Predictive Model Final Report & Technical Documentation

DH (2006). Combined Predictive Model Final Report & Technical Documentation.

7 Derivation and validation of an index to predict early death or unplanned readmission after discharge from hospital to the community.

Walraven C., Dhalla I.A., Bell C., Etchells E., Stiell I.G., Zarnke K., Austin P.C., Forster A.J. (2010). Derivation and validation of an index to predict early death or unplanned readmission after discharge from hospital to the community. CMAJ, 182(6): 551- 557.

8 Development and Validation of a Model for Predicting Emergency Admissions Over the Next Year (PEONY)

Donnan P.T., Dorward D.W.T., Mutch B., Morris A.D. (2008). Development and Validation of a Model for Predicting Emergency Admissions Over the Next Year (PEONY). Arch Intern Med. 168(13):1416-1422.



Appendix A – CPM PMML The following is a listing of a PMML version 4.0 definition of the Combined Predictive Model. <?xml version="1.0"?>

<PMML version="4.0" xmlns="http://www.dmg.org/PMML-4_0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" >

<Header copyright="(c) Crown Copyright 2011" description="Combined Predictive Model">

<Application name="COMBINED PREDICTIVE MODEL FINAL REPORT AND TECHNICAL DOCUMENTATION" version="1.0"/> <Annotation>A PMML description of the Combined Predictive Model, taken from the final report published in

2006</Annotation> <Timestamp>02/12/2011</Timestamp>

</Header>

<DataDictionary numberOfFields="71"> <DataField name="TXID" displayName="Transaction ID which can be correlated to a patient identity" optype="categorical"

datatype="string" />

<DataField name="outcome" displayName="Probability of an emergency admission" optype="continuous" dataType="double" /> <DataField name="agegrp0004" displayName="Age 0-4" optype="categorical" dataType="integer">

<Value value="1" displayValue="Yes"/> <Value value="0" displayValue="No"/>

</DataField> <DataField name="agegrp1539" displayName="Age 15-39" optype="categorical" dataType="integer">



<Value value="1" displayValue="Yes"/>






<DataField name="agegrp7074" displayName="Age 70-74" optype="categorical" dataType="integer">






</DataField>





<DataField name="agegrp95pl" displayName="Age 95+" optype="categorical" dataType="integer"> <Value value="1" displayValue="Yes"/>


</DataField> <DataField name="dem_gender" displayName="Gender" optype="categorical" dataType="integer">

<Value value="1" displayValue="Female"/> <Value value="0" displayValue="Otherwise"/>

</DataField> <DataField name="AE_Invst01_m03_flg" displayName="AE visit - Investigation X-ray - last 90 to 180 days"

optype="categorical" dataType="integer"> <Value value="1" displayValue="Yes"/>


<DataField name="AE_ArrAmb_m02_flg" displayName="AE visit - Arrived by ambulance - last 30 to 90 days"



<DataField name="AE_DispRef_m01_flg" displayName="AE visit - Disposal to Specialist - last 0 to 30 days" optype="categorical" dataType="integer">


</DataField> <DataField name="AE_DxMed_m02_flg" displayName="AE visit - Medical DX (non-injury) - last 30 to 90 days"

optype="categorical" dataType="integer">


</DataField> <DataField name="AE_DxMed_m12_flg" displayName="AE visit - Medical DX (non-injury) - last 365 to 730 days"



<DataField name="AE_NumVisit1_m06_flg" displayName="1 AE visit - last 180 to 365 days" optype="categorical"



dataType="integer">


</DataField> <DataField name="AE_NumVisit2_m06_flg" displayName="2 AE visits - last 180 to 365 days" optype="categorical"

dataType="integer"> <Value value="1" displayValue="Yes"/>


</DataField> <DataField name="AE_NumVisit3pl_m06_flg" displayName="3+ AE visits - last 180 to 365 days" optype="categorical"



<DataField name="Ltc_copd" displayName="COPD (LTC)" optype="categorical" dataType="integer"> <Value value="1" displayValue="Yes"/>


<DataField name="GP_dis47_y12" displayName="Psychoactive substance misuse disorder" optype="categorical"



<DataField name="GP_dis48_y12" displayName="Psychotic disorder" optype="categorical" dataType="integer"> <Value value="1" displayValue="Yes"/>


<DataField name="creatin_3_y02" displayName="Glomerular Filtration Rate Group 3" optype="categorical" dataType="integer"> <Value value="1" displayValue="Yes"/>


</DataField> <DataField name="ChrCnt1_flg" displayName="1 (from 8 - Asthma, Diabetes, COPD, CAD, CHF, Hypertension, Depression,

Cancer) LTC" optype="categorical" dataType="integer"> <Value value="1" displayValue="Yes"/>


<DataField name="ChrCnt2pl_flg" displayName="2+ (from 8 - Asthma, Diabetes, COPD, CAD, CHF, Hypertension, Depression, Cancer) LTC" optype="categorical" dataType="integer">


</DataField>

<DataField name="DisCnt7pl_flg" displayName="7+ distinct disorders (GP data)" optype="categorical" dataType="integer"> <Value value="1" displayValue="Yes"/>


<DataField name="GP_POLY_0104_123" displayName="1-4 unique drugs in any month - last 0 to 90 days" optype="categorical" dataType="integer">


</DataField> <DataField name="GP_POLY_0509_123" displayName="5-9 unique drugs in any month - last 0 to 90 days" optype="categorical"

dataType="integer">


</DataField> <DataField name="GP_POLY_10pl_123" displayName="10+ unique drugs in any month - last 0 to 90 days" optype="categorical"



<DataField name="GP_drug36_m01_flg" displayName="Bronchodilator preparations - last 0 to 30 days" optype="categorical" dataType="integer">


</DataField>



</DataField> <DataField name="GP_drug36_m03_flg" displayName="Bronchodilator preparations - last 90 to 180 days" optype="categorical"



<DataField name="GP_drug36_m06_flg" displayName="Bronchodilator preparations - last 180 to 365 days" optype="categorical"





</DataField> <DataField name="IP_DxMental_y12_flg" displayName="In-patient admission with diagnosis Mental illness - last 0 to 730

days" optype="categorical" dataType="integer">


</DataField> <DataField name="DiagCnt2_flg" displayName="2 distinct in-patient primary diagnosis (any episode) - last 0 to 730 days"



<DataField name="DiagCnt3_flg" displayName="3 distinct in-patient primary diagnosis (any episode) - last 0 to 730 days" optype="categorical" dataType="integer">





</DataField>

<DataField name="DiagCnt4pl_flg" displayName="4+ distinct in-patient primary diagnosis (any episode) - last 0 to 730 days" optype="categorical" dataType="integer">


</DataField> <DataField name="IP_util_EHRG_m01_flg" displayName="Emergency admission for impactable condition (HRG code) - last 0 to

30 days" optype="categorical" dataType="integer">


</DataField> <DataField name="IP_util_EHRG_m02_flg" displayName="Emergency admission for impactable condition (HRG code) - last 30 to

90 days" optype="categorical" dataType="integer"> <Value value="1" displayValue="Yes"/>


<DataField name="IP_util_EHRG_m03_flg" displayName="Emergency admission for impactable condition (HRG code) - last 90 to 180 days" optype="categorical" dataType="integer">



<DataField name="IP_util_EHRG_m06_flg" displayName="Emergency admission for impactable condition (HRG code) - last 180 to 365 days" optype="categorical" dataType="integer">


</DataField> <DataField name="IP_util_E1pl_m01_flg" displayName="1+ Emergency admission - last 0 to 30 days" optype="categorical"



</DataField> <DataField name="IP_util_E1_m02_flg" displayName="1 Emergency admission - last 30 to 90 days" optype="categorical"



<DataField name="IP_util_E2pl_m02_flg" displayName="2+ Emergency admissions - last 30 to 90 days" optype="categorical" dataType="integer">


</DataField>

<DataField name="IP_util_E1_m03_flg" displayName="1 Emergency admission - last 90 to 180 days" optype="categorical" dataType="integer">


</DataField> <DataField name="IP_util_E2pl_m03_flg" displayName="2+ Emergency admissions - last 90 to 180 days" optype="categorical"



<DataField name="IP_util_E1_m06_flg" displayName="1 Emergency admission - last 180 to 365 days" optype="categorical"



<DataField name="IP_util_E2_m06_flg" displayName="2 Emergency admissions - last 180 to 365 days" optype="categorical" dataType="integer">


</DataField> <DataField name="IP_util_E3pl_m06_flg" displayName="3+ Emergency admissions - last 180 to 365 days" optype="categorical"



</DataField> <DataField name="IP_util_E1_m12_flg" displayName="1 Emergency admission - last 365 to 730 days" optype="categorical"



<DataField name="IP_util_E2_m12_flg" displayName="2 Emergency admissions - last 365 to 730 days" optype="categorical" dataType="integer">


</DataField>

<DataField name="IP_util_E3pl_m12_flg" displayName="3+ Emergency admissions - last 365 to 730 days" optype="categorical" dataType="integer">


</DataField> <DataField name="IP_util_EpisperE3pl_flg" displayName="Average number of episodes per Emergency admissions >=3"



<DataField name="IP_HospOE" displayName="Observed/Expected ratio for rate of rehospitalisation for hospital of last

admission" optype="continuous" dataType="double" /> <DataField name="OP_NumVisit1_m01_flg" displayName="1 out-patient specialty visit - last 0 to 30 days"



<DataField name="OP_NumVisit2_m01_flg" displayName="2 out-patient specialty visits - last 0 to 30 days" optype="categorical" dataType="integer">


</DataField>

<DataField name="OP_NumVisit3pl_m01_flg" displayName="3+ out-patient specialty visits - last 0 to 30 days"



optype="categorical" dataType="integer">


</DataField> <DataField name="OP_NumVisit1_m02_flg" displayName="1 out-patient specialty visit - last 30 to 90 days"



</DataField> <DataField name="OP_NumVisit2_m02_flg" displayName="2 out-patient specialty visits - last 30 to 90 days"



<DataField name="OP_NumVisit3pl_m02_flg" displayName="3+ out-patient specialty visits - last 30 to 90 days" optype="categorical" dataType="integer">


</DataField>

<DataField name="OP_NumVisit0105_m12_flg" displayName="1-5 out-patient specialty visits - last 365 to 730 days" optype="categorical" dataType="integer">


</DataField> <DataField name="OP_NumVisit0610_m12_flg" displayName="6-10 out-patient specialty visits - last 365 to 730 days"



<DataField name="OP_NumVisit11pl_m12_flg" displayName="11+ out-patient specialty visits - last 365 to 730 days"



<DataField name="OP_SrcRef5_m01_flg" displayName="OP visit - Source of referral not an Acc and Emergency - last 0 to 30 days" optype="categorical" dataType="integer">


</DataField> <DataField name="OP_SrcRef5_m02_flg" displayName="OP visit - Source of referral not an Acc and Emergency - last 30 to 90

days" optype="categorical" dataType="integer">


</DataField> <DataField name="Smoke_y02_Ltc_asth" displayName="Smoking status 'yes' last 0-365 days multiplied by Asthma (LTC)"



<DataField name="Ltc_copd_11pl_OP_visits_y12" displayName="11+ OP visits (last 0 to 730 days) multiplied by COPD (LTC)" optype="categorical" dataType="integer">



</DataDictionary>

<RegressionModel functionName="regression" modelName="DH_Combined_Predictive_Model" algorithmName="Unknown" modelType="logisticRegression" normalizationMethod="logit">

<MiningSchema>

<MiningField name="TXID" usageType="supplementary"/> <MiningField name="agegrp0004"/>


<MiningField name="agegrp6064"/>




<MiningField name="agegrp95pl"/> <MiningField name="dem_gender"/>

<MiningField name="AE_Invst01_m03_flg"/> <MiningField name="AE_ArrAmb_m02_flg"/>

<MiningField name="AE_DispRef_m01_flg"/>

<MiningField name="AE_DxMed_m02_flg"/> <MiningField name="AE_DxMed_m12_flg"/>

<MiningField name="AE_NumVisit1_m06_flg"/> <MiningField name="AE_NumVisit2_m06_flg"/>

<MiningField name="AE_NumVisit3pl_m06_flg"/> <MiningField name="Ltc_copd"/>

<MiningField name="GP_dis47_y12"/> <MiningField name="GP_dis48_y12"/>

<MiningField name="creatin_3_y02"/> <MiningField name="ChrCnt1_flg"/>

<MiningField name="ChrCnt2pl_flg"/>

<MiningField name="DisCnt7pl_flg"/> <MiningField name="GP_POLY_0104_123"/>

<MiningField name="GP_POLY_0509_123"/> <MiningField name="GP_POLY_10pl_123"/>

<MiningField name="GP_drug36_m01_flg"/> <MiningField name="GP_drug36_m02_flg"/>

<MiningField name="GP_drug36_m03_flg"/> <MiningField name="GP_drug36_m06_flg"/>

<MiningField name="GP_drug36_m12_flg"/> <MiningField name="IP_DxMental_y12_flg"/>

<MiningField name="DiagCnt2_flg"/>

<MiningField name="DiagCnt3_flg"/>



<MiningField name="DiagCnt4pl_flg"/>

<MiningField name="IP_util_EHRG_m01_flg"/> <MiningField name="IP_util_EHRG_m02_flg"/>

<MiningField name="IP_util_EHRG_m03_flg"/> <MiningField name="IP_util_EHRG_m06_flg"/>

<MiningField name="IP_util_E1pl_m01_flg"/> <MiningField name="IP_util_E1_m02_flg"/>

<MiningField name="IP_util_E2pl_m02_flg"/>

<MiningField name="IP_util_E1_m03_flg"/> <MiningField name="IP_util_E2pl_m03_flg"/>

<MiningField name="IP_util_E1_m06_flg"/> <MiningField name="IP_util_E2_m06_flg"/>

<MiningField name="IP_util_E3pl_m06_flg"/> <MiningField name="IP_util_E1_m12_flg"/>

<MiningField name="IP_util_E2_m12_flg"/> <MiningField name="IP_util_E3pl_m12_flg"/>

<MiningField name="IP_util_EpisperE3pl_flg"/> <MiningField name="IP_HospOE"/>

<MiningField name="OP_NumVisit1_m01_flg"/>

<MiningField name="OP_NumVisit2_m01_flg"/> <MiningField name="OP_NumVisit3pl_m01_flg"/>

<MiningField name="OP_NumVisit1_m02_flg"/> <MiningField name="OP_NumVisit2_m02_flg"/>

<MiningField name="OP_NumVisit3pl_m02_flg"/> <MiningField name="OP_NumVisit0105_m12_flg"/>

<MiningField name="OP_NumVisit0610_m12_flg"/> <MiningField name="OP_NumVisit11pl_m12_flg"/>

<MiningField name="OP_SrcRef5_m01_flg"/> <MiningField name="OP_SrcRef5_m02_flg"/>

<MiningField name="Smoke_y02_Ltc_asth"/>

<MiningField name="Ltc_copd_11pl_OP_visits_y12"/> <MiningField name="outcome" usageType="predicted"/>

</MiningSchema>

<RegressionTable intercept="-3.822847424"> <NumericPredictor name="agegrp0004" coefficient="0.289313618"/>



<NumericPredictor name="agegrp7074" coefficient="0.507764968"/>



<NumericPredictor name="agegrp95pl" coefficient="1.416839346"/> <NumericPredictor name="dem_gender" coefficient="0.01177781"/>

<NumericPredictor name="AE_Invst01_m03_flg" coefficient="0.216051313"/> <NumericPredictor name="AE_ArrAmb_m02_flg" coefficient="0.187349103"/>

<NumericPredictor name="AE_DispRef_m01_flg" coefficient="0.632032184"/> <NumericPredictor name="AE_DxMed_m02_flg" coefficient="0.223716412"/>

<NumericPredictor name="AE_DxMed_m12_flg" coefficient="0.321316757"/>

<NumericPredictor name="AE_NumVisit1_m06_flg" coefficient="0.042763882"/> <NumericPredictor name="AE_NumVisit2_m06_flg" coefficient="0.290049439"/>

<NumericPredictor name="AE_NumVisit3pl_m06_flg" coefficient="0.507442635"/> <NumericPredictor name="Ltc_copd" coefficient="0.171100735"/>

<NumericPredictor name="GP_dis47_y12" coefficient="0.54193793"/> <NumericPredictor name="GP_dis48_y12" coefficient="0.528176075"/>

<NumericPredictor name="creatin_3_y02" coefficient="0.264393977"/> <NumericPredictor name="ChrCnt1_flg" coefficient="0.119184904"/>

<NumericPredictor name="ChrCnt2pl_flg" coefficient="0.212972337"/> <NumericPredictor name="DisCnt7pl_flg" coefficient="0.096414136"/>

<NumericPredictor name="GP_POLY_0104_123" coefficient="0.137302707"/> <NumericPredictor name="GP_POLY_0509_123" coefficient="0.388366204"/>

<NumericPredictor name="GP_POLY_10pl_123" coefficient="0.490961533"/>

<NumericPredictor name="GP_drug36_m01_flg" coefficient="0.230925277"/> <NumericPredictor name="GP_drug36_m02_flg" coefficient="0.397601369"/>

<NumericPredictor name="GP_drug36_m03_flg" coefficient="0.339967925"/> <NumericPredictor name="GP_drug36_m06_flg" coefficient="-0.403051621"/>

<NumericPredictor name="GP_drug36_m12_flg" coefficient="-0.176615641"/> <NumericPredictor name="IP_DxMental_y12_flg" coefficient="0.282235541"/>

<NumericPredictor name="DiagCnt2_flg" coefficient="0.132210548"/> <NumericPredictor name="DiagCnt3_flg" coefficient="0.129497741"/>

<NumericPredictor name="DiagCnt4pl_flg" coefficient="0.27788729"/> <NumericPredictor name="IP_util_EHRG_m01_flg" coefficient="0.482474391"/>

<NumericPredictor name="IP_util_EHRG_m02_flg" coefficient="0.265806985"/>

<NumericPredictor name="IP_util_EHRG_m03_flg" coefficient="0.260367409"/> <NumericPredictor name="IP_util_EHRG_m06_flg" coefficient="0.336849464"/>

<NumericPredictor name="IP_util_E1pl_m01_flg" coefficient="0.948115234"/> <NumericPredictor name="IP_util_E1_m02_flg" coefficient="0.476647042"/>



<NumericPredictor name="IP_util_E2_m06_flg" coefficient="0.352014497"/> <NumericPredictor name="IP_util_E3pl_m06_flg" coefficient="0.350301843"/>

<NumericPredictor name="IP_util_E1_m12_flg" coefficient="0.312027413"/>

<NumericPredictor name="IP_util_E2_m12_flg" coefficient="0.32371827"/> <NumericPredictor name="IP_util_E3pl_m12_flg" coefficient="0.483011573"/>

<NumericPredictor name="IP_util_EpisperE3pl_flg" coefficient="0.30864326"/> <NumericPredictor name="IP_HospOE" coefficient="0.721855529"/>

<NumericPredictor name="OP_NumVisit1_m01_flg" coefficient="0.116311541"/> <NumericPredictor name="OP_NumVisit2_m01_flg" coefficient="0.178716728"/>

<NumericPredictor name="OP_NumVisit3pl_m01_flg" coefficient="0.291934635"/> <NumericPredictor name="OP_NumVisit1_m02_flg" coefficient="0.150062538"/>

<NumericPredictor name="OP_NumVisit2_m02_flg" coefficient="0.151688397"/> <NumericPredictor name="OP_NumVisit3pl_m02_flg" coefficient="0.611030329"/>

<NumericPredictor name="OP_NumVisit0105_m12_flg" coefficient="0.182179996"/>

<NumericPredictor name="OP_NumVisit0610_m12_flg" coefficient="0.186734201"/>



<NumericPredictor name="OP_NumVisit11pl_m12_flg" coefficient="0.364758425"/>

<NumericPredictor name="OP_SrcRef5_m01_flg" coefficient="0.101656166"/> <NumericPredictor name="OP_SrcRef5_m02_flg" coefficient="0.322319293"/>

<NumericPredictor name="Smoke_y02_Ltc_asth" coefficient="0.355326713"/> <NumericPredictor name="Ltc_copd_11pl_OP_visits_y12" coefficient="-0.736470348"/>

</RegressionTable>

</RegressionModel>

</PMML>



Appendix B – CPM Test Data TXID,agegrp0004,agegrp1539,agegrp4059,agegrp6064,agegrp6569,agegrp7074,agegrp7579,agegrp8084,agegrp8589,agegrp9094,agegrp95pl,dem_gender,AE_Invst01_m03_flg,AE_ArrAmb_m02_flg,AE_DispRef_m01_flg,AE_DxMed_m02_flg,AE_DxMed_m12_flg,AE_NumVisit1_m06_flg,AE_NumVisit2_m06_flg,AE_NumVisit3pl_m06_flg,Ltc_copd,GP_dis47_y12,GP_dis48_y12,creatin_3_y02,ChrCnt1_flg,ChrCnt2pl_flg,DisCnt7pl_flg,GP_POLY_0104_123,GP_POLY_0509_123,GP_POLY_10pl_123,GP_drug36_m01_flg,GP_drug36_m02_flg,GP_drug36_m03_flg,GP_drug36_m06_flg,GP_drug36_m12_flg,IP_DxMental_y12_flg,DiagCnt2_flg,DiagCnt3_flg,DiagCnt4pl_flg,IP_util_EHRG_m01_flg,IP_util_EHRG_m02_flg,IP_util_EHRG_m03_flg,IP_util_EHRG_m06_flg,IP_util_E1pl_m01_flg,IP_util_E1_m02_flg,IP_util_E2pl_m02_flg,IP_util_E1_m03_flg,IP_util_E2pl_m03_flg,IP_util_E1_m06_flg,IP_util_E2_m06_flg,IP_util_E3pl_m06_flg,IP_util_E1_m12_flg,IP_util_E2_m12_flg,IP_util_E3pl_m12_flg,IP_util_EpisperE3pl_flg,IP_HospOE,OP_NumVisit1_m01_flg,OP_NumVisit2_m01_flg,OP_NumVisit3pl_m01_flg,OP_NumVisit1_m02_flg,OP_NumVisit2_m02_flg,OP_NumVisit3pl_m02_flg,OP_NumVisit0105_m12_flg,OP_NumVisit0610_m12_flg,OP_NumVisit11pl_m12_flg,OP_SrcRef5_m01_flg,OP_SrcRef5_m02_flg,Smoke_y02_Ltc_asth,Ltc_copd_11pl_OP_visits_y12 "TXID0001",0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1.0,0,0,1,0,0,0,0,0,1,0,0,0,0 "TXID0002",0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1.0,0,0,0,0,0,0,0,0,1,0,0,0,0 "TXID0003",0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1.0,0,0,0,0,0,0,1,0,0,0,0,1,0 "CPM Final Report Example",0,0,1,0,0,0,0,0,0,0,0,0,1,1,0,1,0,1,0,0,0,0,0,0,0,1,1,0,1,0,0,0,0,0,1,1,0,0,1,0,0,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0.899824278,0,0,0,0,1,0,0,0,1,0,0,0,0

Documents

Technical Guidance on Selecting and Implementing Predictive Modelling Solutions v1 · 2012-03-06 · Quality, Innovation, Productivity and Prevention (QIPP) is a large scale transformational