6
An Error Detecting and Tagging Framework for Reducing Data Entry Errors in Electronic Medical Records (EMR) System Yuan Ling, Yuan An College of Computing and Informatics, Drexel University Philadelphia, USA {yl638, ya45}@drexel.edu Mengwen Liu, Xiaohua Hu College of Computing and Informatics, Drexel University Philadelphia, USA {ml943, xh29}@drexel.edu Abstract—we develop an error detecting and tagging framework for reducing data entry errors in Electronic Medical Records (EMR) systems. We propose a taxonomy of data errors with three levels: Incorrect Format and Missing error, Out of Range error, and Inconsistent error. We aim to address the challenging problem of detecting erroneous input values that look statistically normal but are abnormal in medical sense. Detecting such an error needs to take patient medical history and population data into consideration. In particular, we propose a probabilistic method based on the assumption that the input value for a field depends on the historical records of this field, and is affected by other fields through dependency relationships. We evaluate our methods using the data collected from an EMR System. The results show that the method is promising for automatic data entry error detection. Keywords—Data Entry Errors; Electronic Medical Records (EMR); Error Detecting; Error Tagging I. INTRODUCTION Currently, large-scale data analysis is conducted in almost every area. Advanced analytical process and data mining techniques have been used to discover valuable information and knowledge from large-scale data. However, most data mining techniques assume that data has already been cleaned [1]. This is not always true in real situations. Data preparation process and data quality issues are critical to reliable data mining results, because data with errors will compromise the credibility of the data mining results. The data lifecycle of an application typically consists of the following steps: data collection, transformation, storage, auditing, cleaning, and analyzing to making decisions [2]. Data errors can be reduced in each step of the data lifecycle. Since data collection is the first step of the lifecycle, reducing data entry errors in the data collection stage is the earliest opportunity to enable data quality. Error and outlier detection have been studied in various application domains, such as credit card fraud detection[3], network intrusion detection[4], and data cleaning in data warehouse[5]. In general, errors can be classified into three types: point errors, contextual errors, and collective errors [6]. In the healthcare domain, data errors in electronic medical databases have been studied recently [7, 8]. Generally, Electronic Medical Records (EMR) data errors have been classified as incomplete, inaccurate, or inconsistent [9]. More specifically, based on EMR visit notes, the previous study [10] classifies the data entry errors into Inconsistent Information, Miscellaneous Errors, Missing Sections, Incomplete Information, and Incorrect Information [10]. For our case, we conducted a comprehensive literature review. As a result, we summarized different levels of data errors in a more detailed manner. Three levels of data entry errors are proposed as shown in Table I. Example errors are discussed in the following paragraphs. TABLE I. LEVELS OF DATA ENTRY ERRORS Level 1: Incorrect Format and Missing L1-1: Incorrect Format Data L1-2: Missing Data Level 2: Out of Range L2-1: Out of normal range L2-2: Out of population trends L2-3: Out of personal trend Level 3: Inconsistent L3-1: Personal Inconsistent L3-2: Population Inconsistent L1-1 errors can be easily controlled with prior knowledge about the data format. Constraints can be set for each field. For example, only numbers are allowed to be entered into blood pressure field. L2-1 error also can be easily controlled as long as there are knowledge about the normal range of each field. For example, height values can be constrained in the range of 0 to 3 meters. Detecting L1-2 errors, however, is a difficult research task, but we can learn to detect missing data according to clinical guidelines and caregiving practices. Once the missing values are detected, we can set constraints to this particular field as we did for L1-1 and L2-1. For L2-2 errors, using statistics for detecting erroneous values for a single attribute is well-studied and robust results can be obtained. For example, L2-2 errors can be calculated using a univariate outlier detection technique called Hampel X84 [11]. In this paper, we do not focus on solving the L1-1 to L2-2 error problems. The challenge lies in the detection of values that look statistically normal but are abnormal in medical sense like L2- 3, L3-1, and L3-2 errors. For example, the weight value of a patient increases to 230lbs from 180lbs in three months. The weight value of 230lbs is in the normal population distribution 249 2013 IEEE International Conference on Bioinformatics and Biomedicine 978-1-4799-1310-7/13/$31.00 ©2013 IEEE

2013 IEEE International Conference on Bioinformatics and ... · There are different ways to detect errors and control data quality, including classification based neural networks,

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 2013 IEEE International Conference on Bioinformatics and ... · There are different ways to detect errors and control data quality, including classification based neural networks,

An Error Detecting and Tagging Framework for Reducing Data Entry Errors in Electronic Medical

Records (EMR) System

Yuan Ling, Yuan An College of Computing and Informatics, Drexel University

Philadelphia, USA {yl638, ya45}@drexel.edu

Mengwen Liu, Xiaohua Hu College of Computing and Informatics, Drexel University

Philadelphia, USA {ml943, xh29}@drexel.edu

Abstract—we develop an error detecting and tagging

framework for reducing data entry errors in Electronic Medical Records (EMR) systems. We propose a taxonomy of data errors with three levels: Incorrect Format and Missing error, Out of Range error, and Inconsistent error. We aim to address the challenging problem of detecting erroneous input values that look statistically normal but are abnormal in medical sense. Detecting such an error needs to take patient medical history and population data into consideration. In particular, we propose a probabilistic method based on the assumption that the input value for a field depends on the historical records of this field, and is affected by other fields through dependency relationships. We evaluate our methods using the data collected from an EMR System. The results show that the method is promising for automatic data entry error detection.

Keywords—Data Entry Errors; Electronic Medical Records (EMR); Error Detecting; Error Tagging

I. INTRODUCTION Currently, large-scale data analysis is conducted in almost

every area. Advanced analytical process and data mining techniques have been used to discover valuable information and knowledge from large-scale data. However, most data mining techniques assume that data has already been cleaned [1]. This is not always true in real situations. Data preparation process and data quality issues are critical to reliable data mining results, because data with errors will compromise the credibility of the data mining results. The data lifecycle of an application typically consists of the following steps: data collection, transformation, storage, auditing, cleaning, and analyzing to making decisions [2]. Data errors can be reduced in each step of the data lifecycle. Since data collection is the first step of the lifecycle, reducing data entry errors in the data collection stage is the earliest opportunity to enable data quality.

Error and outlier detection have been studied in various application domains, such as credit card fraud detection[3], network intrusion detection[4], and data cleaning in data warehouse[5]. In general, errors can be classified into three types: point errors, contextual errors, and collective errors [6].

In the healthcare domain, data errors in electronic medical databases have been studied recently [7, 8]. Generally,

Electronic Medical Records (EMR) data errors have been classified as incomplete, inaccurate, or inconsistent [9]. More specifically, based on EMR visit notes, the previous study [10] classifies the data entry errors into Inconsistent Information, Miscellaneous Errors, Missing Sections, Incomplete Information, and Incorrect Information [10]. For our case, we conducted a comprehensive literature review. As a result, we summarized different levels of data errors in a more detailed manner. Three levels of data entry errors are proposed as shown in Table I. Example errors are discussed in the following paragraphs.

TABLE I. LEVELS OF DATA ENTRY ERRORS

Level 1: Incorrect Format and

Missing

L1-1: Incorrect Format Data L1-2: Missing Data

Level 2: Out of Range

L2-1: Out of normal range L2-2: Out of population trends L2-3: Out of personal trend

Level 3: Inconsistent

L3-1: Personal Inconsistent L3-2: Population Inconsistent

L1-1 errors can be easily controlled with prior knowledge about the data format. Constraints can be set for each field. For example, only numbers are allowed to be entered into blood pressure field. L2-1 error also can be easily controlled as long as there are knowledge about the normal range of each field. For example, height values can be constrained in the range of 0 to 3 meters. Detecting L1-2 errors, however, is a difficult research task, but we can learn to detect missing data according to clinical guidelines and caregiving practices. Once the missing values are detected, we can set constraints to this particular field as we did for L1-1 and L2-1. For L2-2 errors, using statistics for detecting erroneous values for a single attribute is well-studied and robust results can be obtained. For example, L2-2 errors can be calculated using a univariate outlier detection technique called Hampel X84 [11]. In this paper, we do not focus on solving the L1-1 to L2-2 error problems.

The challenge lies in the detection of values that look statistically normal but are abnormal in medical sense like L2-3, L3-1, and L3-2 errors. For example, the weight value of a patient increases to 230lbs from 180lbs in three months. The weight value of 230lbs is in the normal population distribution

249

2013 IEEE International Conference on Bioinformatics and Biomedicine

978-1-4799-1310-7/13/$31.00 ©2013 IEEE

Page 2: 2013 IEEE International Conference on Bioinformatics and ... · There are different ways to detect errors and control data quality, including classification based neural networks,

range, but for this particular patient, it is suspicious if he/she has a healthy living style and does not have major medical problems. It is categorized as a L2-3 error. The following illustrates other examples. A record showing a male patient has the status of pregnancy would be considered as a L3-1 error. An obese and diabetic patient with BMI (Body mass index) larger than 35 is likely to have a cardiovascular disease manifested as hypertension and dyslipidemia. If the patient’s blood pressure values are significantly low, but still in the normal range, the blood pressure value would be inconsistent with the BMI value from the population perspective. This signals a L3-2 type of error.

In this paper, we propose an error detecting and tagging mechanism in real time to address these challenges based on historical data and dependency relationships. The entire framework consists of three main components:

1) Find a similar patient group. Error detection is based on historical datasets. For a single patient, the historical datasets should be composed from patients who have equivalent medical situations, clinical conditions, and basic demographic information. Because it is difficult to find a group of patients with exactly the same medical conditions, we detect errors based on a dataset from patients with similar conditions.

2) Learn a probabilistic model. Based on the dataset from a group of similar patients, we learn parameters for our proposed model. The probabilistic model is used to caculate the input value probability for each field.

3) Return a tag. Error tagging is based on results from the probabilistic model. If an input value probability for a field is relatively high, then the value has a higher chance to be a correct value and a “Normal” tag will be returned. Whereas if the probability is relatively low, the tag is “Suspicious”. We use a threshold to decide which tag should be returned.

Our experimental results show that our error detecting and tagging method is promising. The predictive accuracy of the probabilistic model is between 70% and 85% using our experimental data (explained in the Evaluation section). We also compare our method with existing methods. The results indicate that our method has a better performance.

In summary, the contributions of our paper are:

1) We summarize different levels of data entry errors based on a comprehensive literature review. Our method focuses on detecting and tagging errors that cannot be detected by simple statistical methods.

2) We design a probabilistic model based on two assumptions: first, the current patient’s input value for a field depends on his/her historical records in this field; and second, the input value for a field depends on values from other fields.We combine the influences from these two aspects in our model.

3) We develop an error detecting and tagging framework with five stages: selecting a classification model, finding similar patient group, building a probabilistic model, setting a threshold, and returning tags.

4) We test our method on dataset from an EMR system. Our experimental results show that the predictive accuracy of our probabilistic model is better than Bayesian Network. Our model also generates better results than the other method assuming the input value for a field is based solely on historical records.

The rest of the paper is organized as follows: Section II describes related work. Section III presents our error detecting and tagging framework. Section IV introduces our probabilistic model and error detecting method. Section V presents our evaluation process and experimental results. Section VI presents our conclusions and future research directions.

II. RELATED WORK Statistical methods have been used to analyze and detect

data errors at data cleaning stage [12-14]. However, we need some methods to reduce date entry error in the data collection stage[15].

There are different ways to detect errors and control data quality, including classification based neural networks, Bayesian networks[16], support vector machine, rule-based methods, unsupervised clustering, and instance-based methods. The Usher system is developed to detect errors on form entry fields by using Bayesian network and a graphical model with explicit error modeling[17]. The system includes a probabilistic error model based on a Bayesian network for estimating contextualized error likelihood for each field on a form. The method uses Bayesian network to learn dependency relationships among fields, but it does not take historical records for a particular field into consideration. In this paper, we detect errors based on dependency relationships among fields as well as historical data for a particular field. In addition, we classify patients based on some basic information to cluster them into similar groups before we apply error detection. As our method is probabilistic in nature, selecting the right dataset for estimating the probabilities is critical. For example, it is unreasonable to predict the input value for a person whose age is 50 using a dataset for people whose ages are around 10.

Using tagging mechanism has been shown to decrease certain types of errors [18]. Data entry interfaces can be designed to help improving data quality. We propose a system that classifies the input values of each field as “Suspicious” or “Normal”, and provides this information as feedback to the user. If the “Suspicious” tag occurs, then it reminds the user to check the input value. Consequently, it would help reducing data entry errors.

III. ERROR DETECTING AND TAGGING FRAMEWORK

A. Error Detecting and Tagging Framework Our error detecting and tagging framework is presented in

Fig. 1. A data entry interface contains multiple fields. After a user inputs a value for a field, a tag (“Normal” or “Suspicious”) will be returned.

250

Page 3: 2013 IEEE International Conference on Bioinformatics and ... · There are different ways to detect errors and control data quality, including classification based neural networks,

Fig. 1. Error Detecting and Tagging

Framework Fig. 2. Histroical Records for

Error Detecting Task

Here we use a simple example to illustrate the historical data format and the error detecting task. Fig. 2 shows a set of records for 3 different patients. There are five fields for each record: Date (Observation Date), BP_S (Systolic Blood Pressure value), BP_D (Diastolic Blood Pressure value), BMI, and Weight. Our error detecting task is to tell whether the patient with ID = 3 has a normal or suspicious input value for BP_S, BP_D, BMI and Weight at the observation date: 1/13/2010. The feedback is generated based on the former 8 records from the patient with ID = 3, and also the other 11 records from patients with ID = 1 and ID = 2, assuming these three patients have similar conditions.

Based on the data, we will learn dependency relationships among BP_S, BP_D, BMI and Weight. These relationships are captured in a probabilistic graphical model.

B. Error Detecting Process Based on our error detecting and tagging framework, we

propose the error detecting process as shown in Fig. 3. A patient is modeled as a set of data fields describing the patient’s demographic information and clinical conditions. A form is used to collect a set of specific types of information during an encounter, for example, blood pressure, weight, etc. For our method, we divide the entire patient information into two categories: independent fields which are considered as fixed during the error detection process and dependent field which will be collected and flagged as “normal” or “suspicious”.

Fig. 3. Error Detecting Process

As shown in Fig. 3, our error detecting process consists of the following five steps. First, patient records are categorized into groups based on information about the basic independent fields from the current patient and other historical records using a classification model. Second, we find a similar patient group for the current patient. Third, we build a probabilistic model based on clinical fields and historical records from a group of similar patients. Fourth, we set a threshold for the probability of input value. Finally, based on results from the

probabilistic model and the threshold setting, we return a “Normal” or “Suspicious” tag to the current user.

IV. DATA ENTRY ERROR DETECTING METHOD

A. Data Entry Error Detecting Method Given a data entry interface I , let },,,,{ kjv FFFF ……= be

a set of input fields on the interface I . For example, in Fig. 4, the field for entering the “Weight” value can be represented as

WeightF .

Fig. 4. A Data Entry Form Fig. 5. An example of relationships among fields

For a single field vF , we can use a sequence of values },,,{ 11 ii vvvv −= … to represent all historical records for a

patient, where iv is the current value entered by a user for the field vF , },,{ 11 −ivv … is the historical record sequence for the field vF and 1−iv is the value entered for the field vF at the

thi )1( − time.

We assume that a patient’s current input value for a field depends on historical input records of this field. Abnormal changes of input values comparing with historical records need to get attention. In probability theory, such a sequentially dependent process can be described by a Markov model. According to th1 order Markov assumption, the current value for the field vF at thi time depends only on the value for the same field at thi )1( − time. Therefore, for a giving input value iv , we can use the value 1−iv to predict the probability of the value

iv for the field vF at the current time. The probability can be represented as )|( 1−iim vvq . Similarly, we can have a probability

according to th2 order Markov assumption as ),|( 21 −− iiim vvvq . But for a new patient, there are no historical records for value of field vF . So we can assume that the current value iv for the

field vF at thi time does not depend on previous historical data from the patient. Therefore the predicted probability can be represented as )( im vq .

Our goal is to predict the probability of an input value iv for

the field vF , which can be represented as )( iM vq . Using th1 order

Markov assumption, th2 order Markov assumption, and a prior probability, we can calculate )( iM vq as in Equation (1):

)_|()_,|(

)_,,|()_|(

1

12

213

conditionsallvqconditionsallvvq

conditionsallvvvqconditionsallvq

im

iim

iiim

iM

×+×+

×=

−−

αα

α (1)

251

Page 4: 2013 IEEE International Conference on Bioinformatics and ... · There are different ways to detect errors and control data quality, including classification based neural networks,

“all_conditions” means that the probability is calculated based on a dataset only from patients that have equal demographic and clinical conditions. In the following equations, we will omit “all_conditions” in each term assuming that they have been taken into consideration. The three parameters 1α , 2α , and 3α are used to balance the three probabilities [24], and 1321 =++ ααα . Given a set of records, we can estimate each part of probability according to Equation (2), Equation (3), and Equation (4), respectively.

),(),,(),|(

12

1221

−−

−−−− =

ii

iiiiiim vvN

vvvNvvvq (2)

)(),()|(

1

11

−− =

i

iiiim vN

vvNvvq (3)

)()()(

allNvNvq i

im = (4)

For Equation (2) and Equation (3), ),( 12 −− ii vvN is the occurrence number of input value 2−iv for field vF at the thi )2( − time followed by the input value 1−iv for field vF at the thi )1( − time for a patient. Similarly, ),,( 12 iii vvvN −−

is the occurrence number of input value in a sequence of },,{ 12 iii vvv −−

for the thi )2( − , thi )1( − , and thi time for a patient.

For Equation (4), )(allN is the total number of values for field vF for historical records from all similar patients. Similar patients means a group of patients who are similar to the current patient. The similarity is measured in terms of their demographic and clinical conditions such as age, gender, disease conditions, medications, treatments, social activities, self-management, etc. )( ivN is the occurrence number of iv for field vF from all historical records based on all similar patients.

The Equation (1) only uses the historical records from a single field. However, there are dependency relationships among different fields on a data entry interface I . The relationships among different fields can be obtained based on prior knowledge from experts or learned from historical data [17]. The relationships can be learned from historical records with a Bayesian network using the Max-Min Hill-Climbing (MMHC) method [19]. We implemented the method using the R packages: bnlearn [20] and gRain [21]. An example of learned relationships among fields is showed in Fig. 5. A Conditional Probability Table (CPT) will also be learned for the Bayesian Network.

For the example shown in Fig. 5, the valueiv of vF depends

on the value wF and uF , so we can represent the probability as ),|( iiiB uwvq . By adding the probability of ),|( iiiB uwvq to the Equation (1), we get the following Equation (5):

),|()()( iiiBiMi uwvqvqvq ×+×= γβ (5)

where the two parameters β andγ , 1=+ γβ .

We already get the value of )( iM vq by Equation (1). We can get the value of ),|( iiiB uwvq from CPT of the learned Bayesian

Network as Fig. 5. By plugging in the above values, we compute a final probability )_|( conditionsallvq i .

B. Error Tagging Method We assume that for each field vF the set of possible values

is finite and discrete. If the values for a field are continuous, we apply approapriate methods for discretizing them. Let

},,{ ''' …ii vvV = be the set of possible values for a field vF . The

current input value iv at the thi time is a member of the set. For each possible input values in the set V , we can calculate probability based on the Equation (5). We get a tag result for current input value iv for the field vF as Equation (6).

θθ

<≥

⎩⎨⎧

=)()(

)(i

ii vq

vqSuspicious

Normalvtag (6)

For the threshold value θ in the Equation (6), there are methods that can be applied to set the threshold. For example, 80/20 rule, Linear Utility Function [22] and Maximum Likelihood Estimation[23]. In this paper, we will use 80/20 rule to dynamically decide the threshold value. In particular, probability locates before 20% highest values means the corresponding value is “Normal”, otherwise it is “Suspicious”.

V. EVALUATION The previous sections described the theoretical derivations

of the model and computational steps. In this section, we present an experiment using a set of limited data for evaluating the performance of the model. We aimed to gain some first-hand practical confidence through a pilot study. The limitation of the data is two-fold. First, we only obtained the data for a small number of people, about 1448. Second, we consider a patient model with a small number of fields including age, sex, BP, weight, and BMI. The experiment is by no means thorough and complete because the real model for describing a patient is far more complicated than just sevaral fields. The purpose of the experiment is to demonstrate the effectiveness of the probabilistic model for error detection taking historical data and inter-field dependency relationships into consideration.

We obtained the data from a health clinic affiliated with Drexel University in Philadephia, PA, USA. All the data are de-identified for HIPAA (Health Insurance Portability and Accountability Act) compliance. We only obtained records with six fields without any additional clinical information.

A. Data Sets We selected six fields from databases: Age, Sex, Diastolic

Blood Pressure (BP_D) value, Systolic Blood Pressure (BP_S) value, Weight, and Body Mass Index (BMI). we extracted 1448 patients with 12199 records from all these six fields from the EMR system. Because our dataset is limited and patients’ information is insufficient, we cannot guarantee all patients have equal clinical conditions and treatments in this small scale preliminary study. Only Age and Sex are used as basic fields to conduct classification process to find a similar patients group for current patient. Based on the two basic attributes: age and sex. Patients are divided into 14 categories and the numbers of

252

Page 5: 2013 IEEE International Conference on Bioinformatics and ... · There are different ways to detect errors and control data quality, including classification based neural networks,

patients for each category are shown in Table II. The average number of historical records for each patient is 8.4. We examine our data entry error detecting method for each group.

TABLE II. 16 KINDS OF SIMILAR PATIENTS

Group Age Sex Patients/Records Number

1 [20, 30) M 30/153 2 F 63/567 3 [30, 40) M 68/328 4 F 132/973 5 [40, 50) M 122/643 6 F 188/1967 7 [50, 60) M 191/1513 8 F 270/2552 9 [60, 70) M 103/850 10 F 181/1592 11 [70, 80) M 20/195 12 F 55/658 13 [80, 90) M 4/36 14 F 21/172

a. “M” stands for “Male”, and “F” stands for “Female”

BP_D, BP_S, Weight, and BMI are transformed to discrete value and used as clinical fields to build a Bayesian Network to find the inter-field dependency relationship among these clinical fields. In our experiment, we built a unified Bayesian network using all 12199 records. The Max-Min Hill-Climbing (MMHC) method is used to generate the Bayesian Network structure as shown in Fig. 6.

Fig. 6. Learned Bayesian Network Fig. 7. Predictive Accuracy and

Number of Records of 14 Groups

The variable “BP_D” depends on the variable “BP_S”. The joint probability between “BP_D” and “BP_S” can be calculated from the CPT (Conditional Probability Table) of Bayesian Network.

B. Experimental Results Based on our dataset, we used our method to predict the

value of variable “BP_D”. The predictive power of our method is measured in individual 3-fold cross validation experiments for each group. The average predictive accuracy and number of records for each group is shown in Fig. 7. The accuracy is calculated as correctly predicted instance number divided by total instance number. The parameters set here are

3/1321 === ααα , 4/3=β , and 4/1=γ .

Fig. 8. Comparative Performance of

Usher with Our Method Fig. 9. Accuracy Comparison for

Different Parameters Set

The average predictive accuracy for all groups is 71%. The predictive power is stable for all these groups, which ranges from 70% to 85%, except for group1, group13, and group14. Their numbers of records are lower than 200, so their predictive accuracy are relatively lower.

We implement Usher’s error model [17] on our dataset. Usher induced the probability of making an error using an error model. The comparative results of Usher and our method is shown in Fig. 8. Usher’s average accuracy is 63.2%, while our method is 71%. Except for group 1 and group 13 in which the numbers of records are too small, our method has better performance than Usher.

Since the number of records and accuracy of group 8 is relative high. We test different sets of parameters on the dataset of group 8. The different parameter sets and corresponding accuracy results are as shown as Table III and Fig. 9. There are 9 different sets of parameters.

TABLE III. ACCURACY RESULTS FOR DIFFERENT PARAMTERS SET

Set 1α 2α 3α β γ Accuracy

1 0 0 1 1 0 0.736813 2 1/3 1/3 1/3 1 0 0.748621 3 0 0 0 0 1 0.757969 4 1/3 1/3 1/3 0.75 0.25 0.754706 5 1/3 1/3 1/3 0.5 0.5 0.75221 6 0.25 0.25 0.5 0.5 0.5 0.780027 7 0.2 0.2 0.6 0.5 0.5 0.783213 8 0.1 0.1 0.8 0.6 0.4 0.782445 9 0 0 1 0.5 0.5 0.791291

As shown in Table III, in Set 1 and Set 2, 1=β and 0=γ , which means that prediction only depends on past historical values of the current single field. In set 3, 0=β and 1=γ , which means that prediction only depends on other fields, but does not depend on historical values of current single field. That is, the result of set 3 is the result of only using Bayesian network. The accuracy results from set 1, 2 and 3 are relatively low. In summary, the accuracy results obtained by the methods purely depending on past historical values of the current single field or on dependency relationships among multiple fields learned from Bayesian network are not as good as the results obtained by combining these two aspects.

In Fig. 9, sets 6, 7, 8, and 9 have relatively higher accuracy. In these four sets, parameters β and 3α have higher weight. That means prediction depending on all former 2 stages of historical values of the current single field and dependency relationships among multiple fields improves the accuracy. The set 9 has highest accuracy. But in practice, it is not recommended, because for some small datasets, 3α might equal to 0. The recommended parameter sets are Set 6, 7, and 8.

We also conduct an experiment to compare the performance of our model to that of the Hampel X84[11]. Based on our method, the “Normal” ranges for a patient’s “BP_D” value can dynamically narrow down to [70, 90] as shown in Fig. 10. a), while the Hampel X84 gave “Normal” to a broader range [45.3, 114.7] as shown in Fig. 10. b) for all population. The narrower range indicates that our method is more sensitive to error detection.

253

Page 6: 2013 IEEE International Conference on Bioinformatics and ... · There are different ways to detect errors and control data quality, including classification based neural networks,

a) Our method b) Hampel X84

Fig. 10. Our method compared with Hampel X84

VI. DISCUSSION AND FUTURE WORK This paper proposes a five-step error detecting and tagging

framework for reducing data entry errors that pose significant challenges to existing statistic methods. Our probabilistic model for error detecting is based on the assumption that input values for a field can be predicated using historical data and inter-field dependency relationships among fields.

We conducted pilot experiments to evaluate our method. Results show that our probabilistic model has better accuracy than a Bayesian network, it is also better than the method based on a Markov assumption that the input value for a field is based only on historical records of this single field. We also compare our method with Usher’s error model. The result shows our method has a better performance. Moreover, our method has more precise results than the Hampel X84 method.

Our future research work will focus on following aspects:

1) Currently, our tagging method can only return two kinds of tags as “Normal” and “Suspicious”. We would like to extend our model to detect multiple errors as shown in Table I.

2) Our error detecting method only focuses on discrete input value. We plan to improve our method to deal with continuous input values.

3) We plan to conduct a more thorough experimental evaluation by collecting more patient data with more attributes in the future.

ACKNOWLEDGMENT This work is supported in part by a Drexel Jumpstart grant

on Health Informatics and the NSF grant IIP 1160960 for the center for visual and decision informatics (CVDI).

REFERENCES [1] Dasu, T. and T. Johnson, Exploratory data mining and data cleaning.

Vol. 479. 2003: Wiley-Interscience. [2] Batini, C. and M. Scannapieca, Data quality. 2006: Springer. [3] Chan, P.K., et al., Distributed data mining in credit card fraud

detection. Intelligent Systems and their Applications, IEEE, 1999. 14(6): p. 67-74.

[4] Lazarevic, A., et al., A comparative study of anomaly detection schemes in network intrusion detection. Proc. SIAM, 2003.

[5] Rahm, E. and H.H. Do, Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., 2000. 23(4): p. 3-13.

[6] Chandola, V., A. Banerjee, and V. Kumar, Anomaly Detection: A Survey. Acm Computing Surveys, 2009. 41(3).

[7] Arts, D.G., N.F. De Keizer, and G.-J. Scheffer, Defining and improving data quality in medical registries: a literature review, case study, and generic framework. Journal of the American Medical Informatics Association, 2002. 9(6): p. 600-611.

[8] Beretta, L., et al., Improving the quality of data entry in a low-budget head injury database. Acta Neurochirurgica, 2007. 149(9): p. 903-909.

[9] Phillips, W. and Y. Gong, Developing a nomenclature for emr errors, in Human-Computer Interaction. Interacting in Various Application Domains. 2009, Springer. p. 587-596.

[10] Khare, R., An, Y., Wolf, S., Nyirjesy, P., Liu, L., & Chou, E, Understanding the EMR error control practices among gynecologic physicians. iConference 2013 Proceedings, p.289-301.

[11] Hellerstein, J.M., Quantitative data cleaning for large databases. United Nations Economic Commission for Europe (UNECE), 2008.

[12] Nahm, M.L., C.F. Pieper, and M.M. Cunningham, Quantifying Data Quality for Clinical Trials Using Electronic Data Capture. Plos One, 2008. 3(8).

[13] Goldberg, S.I., A. Niemierko, and A. Turchin, Analysis of data errors in clinical research databases. AMIA Annu Symp Proc, 2008: p. 242-6.

[14] Jacobs, B., Electronic medical record, error detection, and error reduction: a pediatric critical care perspective. Pediatr Crit Care Med, 2007. 8(2 Suppl): p. S17-20.

[15] Goldberg, S.I., et al., A Weighty Problem: Identification, Characteristics and Risk Factors for Errors in EMR Data. AMIA Annu Symp Proc, 2010. 2010: p. 251-5.

[16] Wong, W., et al. Bayesian network anomaly pattern detection for disease outbreaks. in MACHINE LEARNING-INTERNATIONAL WORKSHOP THEN CONFERENCE-. 2003.

[17] Chen, K., et al., Usher: Improving data quality with dynamic forms. Knowledge and Data Engineering, IEEE Transactions on, 2011. 23(8): p. 1138-1153.

[18] Chen, K., J.M. Hellerstein, and T.S. Parikh. Designing adaptive feedback for improving data entry accuracy. in Proceedings of the 23nd annual ACM symposium on User interface software and technology. 2010. ACM.

[19] Tsamardinos, I., L.E. Brown, and C.F. Aliferis, The max-min hill-climbing Bayesian network structure learning algorithm. Machine learning, 2006. 65(1): p. 31-78.

[20] Scutari, M., Learning Bayesian networks with the bnlearn R package. arXiv preprint arXiv:0908.3817, 2009.

[21] Højsgaard, S., Graphical Independence Networks with the gRain package for R, 2012, Journal.

[22] Arampatzis, A., et al., KUN on the TREC-9 Filtering Track: Incrementality, decay, and threshold optimization for adaptive filtering systems. 2000.

[23] Zhang, Y. and J. Callan. Maximum likelihood estimation for filtering thresholds. in Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval. 2001. ACM.

[24] Michael Collins. Language Modelling, http://www.cs.columbia.edu/~mcollins/. 2013.

254