A rule-based transformation system for converting … · A rule-based transformation system for converting semi-structured medical documents ... cost-efficient handling and processing

Health Technol. (2013) 3:51–63DOI 10.1007/s12553-013-0040-0

ORIGINAL PAPER

A rule-based transformation system for convertingsemi-structured medical documents

Johannes Heurix · Antonio Rella · Stefan Fenz ·Thomas Neubauer

Received: 25 October 2012 / Accepted: 29 January 2013 / Published online: 2 March 2013© IUPESM and Springer-Verlag Berlin Heidelberg 2013

Abstract The electronic health record (EHR) enabled thecost-efficient handling and processing of health informa-tion. By storing medical documents in digitized form, theEHR ensures the fast and reliable availability of health infor-mation. However, as not all health care providers use thesame EHR format and health document structures, the effi-ciency of the data sharing process is severely reduced. Inthis paper we present a rule-based transformation systemthat converts semi-structured (annotated) text into standard-ized formats, such as HL7 CDA. Our approach identifiesrelevant information in the input document by analyzing itsstructure as well as its content and inserts the required ele-ments into corresponding reusable CDA templates, wherethe templates are selected according to the CDA documenttype-specific requirements. The research results enable theefficient transformation of various EHR structures into astandardized format and therefore contribute to increasingthe data sharing efficiency across health care providers.

J. Heurix (�)SBA Research, Favoritenstrasse 16, 1040 Wien, Austriae-mail: [email protected]

A. RellaXiTrust Secure Technologies, Grazbachgasse 67,8010 Graz, Austriae-mail: [email protected]

S. Fenz · T. NeubauerInstitute of Software Technology and Interactive Systems,Vienna University of Technology,Favoritenstrasse 9-11, 1040 Wien, Austria

S. Fenze-mail: [email protected]

T. Neubauere-mail: [email protected]

Keywords Information extraction · Text transformation ·Clinical document architecture · Semi-structured text

1 Introduction

Health care is no longer solely tasked with simply curingsick people. The increase in knowledge, advancements inmedical techniques and the overall quality of diagnosticshave led to a massive increase in information process-ing and storage demands. Managing the vast amounts ofinformation produced nowadays is one of the main chal-lenges of today’s health care. As in virtually all domainsthese days, information and communication technologies(ICTs) have found its way into health care to managethe ever growing quantities of data. E-health denotes theapplication of ICTs to support medical workflows and toimprove the communication between health care providers.Interconnected systems, such as electronic health records(EHR), are aimed to enhance the data processing facil-ities while keeping the costs at a controlled level [2].Initiatives like ‘Integrating the Healthcare Enterprises’ [13]have published communication standards to harmonize dataexchanging and sharing mechanisms to ultimately improvepatient care.

However, truth is that although specialized medicaldata processing systems, especially image processing sys-tems, are highly advanced and effective, organizational andadministrative systems lack well behind their potential capa-bilities. Today’s ICT environment in medical facilities ischaracterized by numerous legacy systems and isolatedapplications including hospital information systems. Tex-tual information is stored in different data formats, whichcomplicates data sharing between different systems. Fur-thermore, many digitized documents, such as discharge

mailto:[email protected]




52 Health Technol. (2013) 3:51–63

letters, clinical notes, or medical histories, are in narrativeform making automated data processing much more diffi-cult. The XML-based Health Level 7 Clinical DocumentsArchitecture (HL7 CDA [12]) provides a common standardfor representing medical information in a structured way.CDA documents separate administrative information andthe actual medical content into header and body sections.This separation between administrative and medical sec-tions facilitates privacy-preserving storage and data sharingmechanisms, such as pseudonymization (cf. [17]). The dif-ficulty in creating CDA documents is to extract the relevantinformation from the unstructured documents of differentdata formats of the legacy systems, to convert it, and toinsert it into the corresponding sections of CDA.

This article presents a simple yet effective rule-basedtransformation system to automatically convert semi-structured narrative medical texts (i.e., annotated inputstrings) into CDA-conforming documents by identifyingrelevant information and inserting it into composable CDAtemplates. Both transformation rules and CDA templatesare structured in XML and are separated from each otherin order to support different document type-specific setsof rules to be applied to the same templates, dependingon the input string’s semantic structure. The rules’ syntaxallows non-technical domain experts with minimum XMLknowledge to design complex extraction conditions, whilethe separation of the rules from the CDA templates alsofacilitates reusability. Due to its modularity, the system canbe combined with different annotation engines, such as theGATE framework (cf. [4]).

2 Background

Information extraction (IE) is a sub-domain of the naturallanguage processing (NLP) discipline and is tasked withthe automated extraction of structured information fromunstructured sources [19]. IE is closely related to infor-mation retrieval (IR), but IR’s goal is to return documentsas a result of a search whereas IE returns informationor facts [16]. A typical IE system has two objectives: (i)identifying and annotating potentially relevant informationin the narrative input text and (ii) the actual extractionand transformation of the desired information into the tar-get form. The annotation task usually involves multiplepre-processing steps including tokenization, part-of-speechtagging, or parsers for boundary and named-entity recog-nition which are executed in a pipeline where the outputof the former step is used as input for the next step toimprove the quality of the overall result. Existing worklargely does not distinguish entity recognition or annota-tion and actual information extraction. Text processing stepsare composed into adaptable annotation frameworks, such

as GATE [4], C-PANKOW [3] or UIMA [5] which providetheir different processing functionality such as tokeniza-tion or whitespace identification through special plug-ins.IE systems can be categorized into two fundamentally dif-ferent types [1]: (i) the Knowledge Engineering Approachwith hand-made rules written by domain experts to iden-tify and correctly mark the relevant information entities,and (ii) the Automatic Training Approach where the systemcreates the rules itself by analyzing manually pre-labeled(annotated) training corpora. There has been an ongo-ing debate on which of these approaches performs better[1, 19]: Knowledge engineering-based systems tend to pro-duce good results very fast, especially when training dataare tedious and costly to acquire. They also excel when theannotation and extraction specifications are likely to changein a foreseeable way (e.g., when the layout of documentsare updated). The big downside of hand-crafted rules is theactual work to define them. Creating accurate rules oftenrelies on an iterative testing and adapting process. Training-based systems relieve the domain expert from this manualrule-creation work, which means that no (potentially) com-plex rule formalisms need to be learned. But in essence,automatic training-based systems shift the workload fromcreating the rules to annotating the training corpora. To pro-duce reasonably good results, these systems require a largenumber of manually annotated documents. Thus, to selectthe adequate system for a particular IE problem, it highlydepends on the availability of training data and their struc-ture, the technical expertise of domain experts, as well asthe structural and lexical properties of the document base.

IE systems have been applied in a multitude of applica-tions including enterprise applications, scientific purposes,or web-based systems [19]. Although originally developedoutside the biomedical and clinical area, information extrac-tion has since been adapted to the medical domain as well.One of the first systems developed and successfully usedwas the often cited MedLEE system [7, 8], an NLP sys-tem to identify radiology information in narrative text tobe mapped to a structured representation containing clini-cal terms. Some other work dealing with radiology reportsinclude [11] or more recently [6]. Information extractionmethods were also applied in the clinical domain in othercontexts including discharge summaries, e.g., [20], [14], and[25], either for extracting generic information or for specialpurpose such as extracting ICD codes [26]. Another popu-lar application area of information extraction in the clinicalcontext involves the automated de-identification of medicaldocuments for secondary use. Exemplary work include theidentification of names only [21], systems based on GATE[10] or MedLEE [15], or systems specifically tailored for aparticular language, such as French [9] or Swedish [23].

Unlike these approaches, this work focuses on the auto-mated transformation of medical documents into the HL7

Health Technol. (2013) 3:51–63 53

(Version 3) Clinical Document Architecture [12] format,although the system can be easily adapted to transforminformation to any XML-based document format. The trans-formation process includes the automated extraction ofrelevant text passages and entities and the insertion intoCDA-conforming XML documents. CDA documents areorganized into header sections, containing administrativedata, such as the patient’s name and address encapsulatedinto mandatory XML blocks, and more flexible body sec-tions with the actual medical data. Depending on the CDAlevel (1 to 3), this content is organized in raising gran-ularity: While level 1 contains largely free narrative textblocks only, the narrative blocks are extended with speciallyencoded machine-readable sections in level 3 documents.Codes are taken from established standards in the clinicaldomain, such as the Logical Observation Identifiers Namesand Codes [18] and International Statistical Classificationof Diseases and Related Health Problems (ICD Version 10[24]). Considering a typical discharge letter, we identifiedthe following requirements for a document transformationsystem (examples in parenthesis):

– Identification and extraction of single named-entities(names, locations).

– Distinction between different instances of the sameentity class (patient name vs. health professional name).

– Composition of entities (social security number withbirth date).

– Limited sections of text blocks (ICD codes).– Complete text blocks (diagnosis).– Multiple successive text blocks (medical history

extending over multiple paragraphs).

3 Transformation of semi-structured textualinformation

In this paper, we present a rule-based transformationmethodology for converting medical information extractedfrom semi-structured texts into CDA-conforming docu-ments with the following characteristics:

– Using a rule-based approach to exploit the similarityof medical documents (e.g., discharge letters almostalways contain sections including diagnosis, reasonfor visit, procedures, etc. separated into logical units,such as paragraphs) making writing rules easierand, thus, achieving results faster without a largetraining set.

– Separating annotation from actual transformation tobenefit from different annotation engines, such asGATE, C-PANKOW, or UIMA.

– Separating transformation rules from target documenttemplates to facilitate template reuse when adapting

rules to a different document set (e.g., different layoutsof medical records depending on the creation date).

– Designing the transformation rules’ syntax and seman-tics to allow non-technical domain specialists the devel-opment of sophisticated rules covering all requirementswithout learning complex formalisms (e.g., XSLT com-bined with complex XPath expressions).

– Using both structural and content-related knowledge toformulate the rules.

The transformation methodology relies on annotationengines to provide annotated semi-structured input strings.As a precondition, the string needs to be (i) partitionedinto semantically-contained text blocks, such as para-graphs (section boundaries identification) and (ii) annotatedwith pre-specified entity classes including names or dates(named-entity recognition). Our prototype accepts inputstrings structured as XML documents where the text blocksare surrounded with <p> tags. Each text block in turncan have multiple named-entities, again annotated withXML tags (e.g., <a firstname>1). The document templatesare organized into logical CDA sections (recordTarget,author, diagnosis, etc.) and contain the static text body (pre-defined) and references where the extracted informationshould be inserted into. Templates are selected and com-posed depending on the document types (discharge letter,lab results, etc.). Each of the templates is assigned one ormore rules which are expressed in XML as well and con-tain sub-rules for each entity-of-interest in each template.Multiple rules for a single template account for differentdocument subtypes (e.g., layout changes of discharge letterin the course of time).

The transformation process is sketched in Fig. 1:

1. Identification of the document (sub)type and selectionof the appropriate rules and templates.

2. For each template:

(a) Identification of the region-of-interest (one ormore paragraphs) using region structure (com-bination of annotations) and/or content (regularexpression) information as decision tools.

(b) Extraction of the actual content-of-interest fromthe region-of-interest using either region struc-ture or content information as filters.

(c) Insertion of the content-of-interest into the cor-responding sections in the template.

(d) Continuation with the next template until alltemplates and corresponding rules are processed.

3. Composition of the templates into a full CDAdocument.

1‘a’ for annotation to distinguish these tags from any pre-existing tagsin the source input string

54 Health Technol. (2013) 3:51–63

Fig. 1 Transformation process

In the following, templates and rules, their semantics, andtext processing effects are described in detail. As the syntaxis XML-based, both templates and rules must conform toXML schema definitions. For better overview, the templateand rule structures are represented as figures with boxes asXML elements (nodes), where element attributes are shownbelow and element text content on the right. Parent/childrelationships of the nodes are expressed by connecting lineswith cardinality indicators, extended with sequence/choicebars if applicable.

3.1 Templates

A template represents a building block (e.g., diagnostic sec-tion) that needs to be filled with the extracted informationand combined with other templates to form the completeCDA document. Each <template> (see Fig. 2) has a unique‘ID’ and contains the <itemList> and <content> sections.While <content> contains the static text building blockswith empty sections that need to be filled with informationextracted from the input string, <itemList> contains a set

Fig. 2 Template structure

Health Technol. (2013) 3:51–63 55

of <itemPath> elements corresponding to each empty fieldin the <content> section. An <itemPath> definition hasan ID, optional prefixes and suffixes (i.e., static text that isadded to the extracted information like area codes), and anindicator whether the particular field is optional or not (e.g.,patient names are mandatory while telecom values are not).The XPath expression specifies the field’s location in the<content> section.

3.2 Rules

A rule encodes the information of how to identify (region-of-interest) and extract the relevant entities (content-of-interest) from the source input string. Each <rule> (seeFig. 3) must correspond to a particular template indicated bythe ‘targetTemplate’ attribute. The other attribute defines adocument subtype for which the rule is applicable (e.g., dis-charge letters from different wards within a hospital). Eachrule contains one or more <region> nodes representingthe regions-of-interest (i.e., paragraphs) where the <con-ditions> section denotes the structural and content-relatedconditions for finding the correct region and <contentMap-ping> the actual content that needs to be extracted from theregion(s) and inserted into the corresponding parts of theCDA templates.

The region’s ‘nodeType’ attribute defines different waysof how to find the particular nodes, (i.e., paragraphs in theinput string) of the regions-of-interest. Depending on ‘node-Type’, a different set of child <(begin/end)nodeCondition>

elements is necessary in the <conditions> section (seeFig. 4). Types, required children in the <conditions> sec-tion, and purposes are as follows:

– The ‘single’ type states that the region consists of asingle paragraph only and, thus, requires only a single

<nodeCondition> element. It is used to get a nodewhere its composition is known relatively well.

– The ‘singleSelected’ type states that the region consistsof multiple paragraphs that do not necessarily occur inimmediate succession in the source input string. Only asingle node per node condition is returned. It requirestwo or more <nodeCondition> elements and is used toget multiple nodes where each node condition refers toa single node and the nodes’ compositions are knownrelatively well.

– The ‘multiSelected’ type states that the region con-sists of all nodes matching any node condition andrequires one or more <nodeCondition> elements. It isused to get multiple nodes where the nodes may con-tain any known regular expression keywords and/or tags(see below).

– The ‘multiContinuous’ type states that the region con-tains all nodes between a beginning node, <beginNode-Condition>, and an ending node, <endNodeCondi-tion>. The ‘conditionType’ attribute determines wheth-er the begin/end nodes are included in the region ornot. The ‘multiContinuous’ type is used to get all nodeswithin two known boundaries.

The <nodeCondition> element contains either a <tar-getRegexList> or a <targetTags> element, or both. Theformer describes a condition on the paragraph’s content,while the latter states the set of required XML tags, i.e.,annotations within the paragraph. The <targetRegexList>contains one or more <targetRegex> elements with theactual regular expression as text value. The attributes ‘link-ingType’ and ‘conditionType’ indicate how the individualregular expressions are linked (and/or) and whether the par-ticular regular expression must or must not yield a match

Fig. 3 Rule structure

56 Health Technol. (2013) 3:51–63

Fig. 4 Rule - conditionsstructure

in the paragraph’s text in order to be part of the region-of-interest (include/exclude). Similarly, <targetTags> con-tains one or more annotations defined as <targetElement>and tag name as text values which have to be occur-ring either in a particular sequence or in any sequence(tagType). The ‘tagComplete’ attribute defines if the para-graph must not contain any additional annotation otherthan those of <targetElement>, while ‘occurrence’ statesthe desired occurrence of the annotation combination (e.g.,occurrence = 2 matches the second paragraph with theparticular annotation combination).

Apart from the ‘conditionType’ as attributes (explainedfurther above), <beginNodeCondition> and <endNode-Condition> are composed just like the <nodeCondition>

elements (Fig. 5) .While <conditions> indicate the conditions under which

the paragraphs belong to the region-of-interest of the cur-rent rule, <contentMapping> defines what information toactually extract from the region-of-interest. The sectioncontains one or more <mapping> elements. The attribute‘ID’ indicates where to insert the extracted information intothe templates (matching the <itemPath> ‘ID’ attribute),

while ‘mapType’ defines how to extract the information,defined as follows:

– The ‘singleElement’ type represents a 1:1 mapping ofan annotation tag’s content (determined by tag nameand its ‘occurrence’ within the region-of-interest, e.g.,second occurrence of <a firstName> as the patient’sfirst name) to a particular section within the CDAtemplate. Optionally, the content can be filtered by aregular expression or be composed of the annotationtag’s attributes instead of the text content (indicated byone or more <targetAttribute> elements).

– The ‘composedElement’ type represents a contentwritten into the CDA template composed of multi-ple annotated <targetElement> tags, separated by a‘separator’ (default is an empty string), e.g. the com-position of the Austrian 10-digit social security numberusing the 4-digit version combined with the person’sbirth date. The <targetElement> elements again can befiltered by regular expressions or attribute values.

– The ‘fullContent’ type simply refers to extractingthe whole content of the region-of-interest into the

Health Technol. (2013) 3:51–63 57

Fig. 5 Rule - content mapping structure

corresponding section of the CDA template. It can(optionally) be filtered by a single <targetRegex>.An example for ‘fullContent’ is to copy the com-plete narrative paragraph of a diagnosis to the CDA

template, while it may be filtered to extract only theICD codes.

– The ‘multiContent’ type refers to multiple elementswithin the region-of-interest and inserted into the CDA

58 Health Technol. (2013) 3:51–63

template separated by a default grouping tag,2 unlessotherwise specified. The content can be filtered eitherby a <targetRegexList> or by <targetTags>. The<targetRegexList> contains a set of one or more<targetRegex> elements with attributes ‘encounter-Type’ and ‘groupingTag’. The former attribute distin-guishes between extracting either only the first regu-lar expression-matching string or all matching stringswithin the region-of-interest, while the optional lat-ter attribute overrides the default grouping tag with analternative one for each individual regular expression.<targetTags> contains the set of <targetElement> tagswhose contents are to be extracted. The ‘includeTag’attribute indicates whether the elements’ original tagsshould be copied to the template too (in this case,the default grouping tag is replaced with the originalone). The optional attribute ‘renameTagTo’ allows toindividually rename the tag to an arbitrary one for each<targetElement>. Both <targetTags> and <targe-tRegexList> have their uses, e.g., to extract ICD codesfrom a diagnostic text spanning over multiple para-graphs, depending on how these are annotated by theannotation engine: If they are already annotated, <tar-getTags> can be used to extract them with potentiallyrenaming the annotation tags with arbitrary ones match-ing the CDA standard. If the annotation engine is notable to annotate them, <targetRegexList> may containthe regular expressions to extract them (in this case withencounter Type = all), again with an optional renamedgrouping tag.

All mapping sections can have an additional and optional<replacement> section which contain <valuePair> ele-ments. These refer to text elements (‘source’) that shouldbe replaced with another text (‘value’), e.g., if the word‘female’ is encountered within a <gender> tag, it shouldbe replaced with ‘F’ before being inserted into the patienttemplate of the CDA header section.

3.3 Algorithm

The core of the transformation process is the algorithmto identify the regions-of-interest, followed by the extrac-tion of the content-of-interest. For the sake of clarity, wefocus here on the condition and mapping types. For detailson handling <targetTags> and <targetRegex> elements,please refer to the previous section. The overall procedureis depicted in algorithm 1.

The process consists of identifying and loading the cor-rect rules and templates (lines 1–3), iterating all rules andapplying the conditions and content mappings (lines 4–10),

2As this is only used in CDA body templates, we simply use <para-graph> here.

composing the filled templates to a full CDA document andvalidating the document for missing sections (lines 11–17).

A more in depth view of the ‘Apply Conditions’ oper-ation is depicted in algorithm 2. It distinguishes betweendifferent node types (cf. Section 3.2): ‘single’ looksfor the first paragraph (i.e., section in the source doc-ument enclosed in <p> tags) which matches the tagand/or regex condition (lines 1–8); ‘singleSelected’ returnsall matching paragraphs per each node tag/regex condi-tion (lines 9–16); ‘multiSelected’ in turn yields all para-graphs that match any tag/regex conditions within thecurrent rule (lines 17–24), while ‘multiContinous’ firstsearches for the begin node, then for the end node (inthe remaining paragraphs) by their corresponding tag/regexconditions and marks all paragraphs between them (option-ally including them) as region-of-interest (lines 25–38).

How the content mappings are applied is shown inalgorithm 3. Depending on the ‘mapType’ the content-of-interest is extracted differently from the region-of-interest:‘single’ simply extracts the tag content (possibly filtered byregex, lines 1–5), while ‘composed’ combines multiple tagcontents to a single element in the order defined by the map-ping conditions’ order (lines 6–10); ‘fullContent’ in turnsimply selects the whole content of the region-of-interest(possibly filtered) as content-of-interest (lines 11–14), andfinally ‘multiContent’ allows combining multiple sectionsin the region-of-interest, possibly filtered either by tagsor regex, with predefined grouping tags to a single CDAsection entry.

Health Technol. (2013) 3:51–63 59

4 Example

In this section, a practical example of a transformationprocess is given where a patient’s name and address areextracted from a digitized and annotated medical dis-charge letter (source input string). The input string isformatted as XML document with paragraphs representedas enclosing <p> tags. Furthermore, the source doc-ument is also extended with pre-specified tags includ-ing <a title> for academic title, <a firstName> for first

60 Health Technol. (2013) 3:51–63

Fig. 6 Example source document (excerpt)

Fig. 7 CDA header section: patientRole

Health Technol. (2013) 3:51–63 61

names, <a lastName> for last names, <a date> for dates,and so on, by a dedicated annotation engine. An excerptof the source document is shown in Fig. 6. Note that theannotation engine only adds annotations while leaving thedocument structure as it is (line breaks, spaces, etc.).

The target template, i.e., the ‘recordTarget’ header sec-tion of the CDA document, is shown in Fig. 7 and consistsof the static segment specified by the CDA document stan-dard with empty attribute/element sections (lower part) andthe corresponding <itemPath> elements with the identifiersand XPath expressions (upper section). For example, theitem ‘nameGiven’ has to be inserted into the empty element‘/recordTarget/patientRole/patient/name/given’.

The rule, or more specifically, the particular regionthat is responsible for the patient’s name and address isdepicted in Fig. 8: The <region> is of ‘multiContinuous’type where the <conditions> section contains the specifi-cations for the nodes beginning and ending the region-of-interest. For the beginning node, the particular paragraphhas to include either the text segment ‘born on’ or ‘dateof birth’, as well as contain both tags <a lastName> and<a date>. The attribute ‘tagComplete’ = ‘false’ denotesthat other tags are allowed within this paragraph. Return-ing to the source document (Fig. 6), the second paragraphmatches this condition, as the first misses out the two regexand does not include the <a date> tag either. Due to theircommon occurrence in medical documents, especially theuse of keywords such as ‘born on’ in combination with tagsis useful in solving the name disambiguity problem (e.g.patient’s name vs physician’s name) and distinguishing oth-erwise similar sections. The second condition in the rule forthe ending node consists of the single regex keyword ‘Diag-nosis:’ which matches the fourth paragraph in the sourcedocument, thus the region-of-interest includes the secondand third paragraphs (‘conditionType’ = ‘exclude’ preventsthe inclusion of the fourth paragraph in this list).

The actual content mapping involves only a simpleextraction of tag contents (‘mapType’ = ‘singleElement’)for the first/last names and address components without anyregex filtering and insertion into the template.

5 Case study and extensions

We have developed a prototype written in Java as part ofa system to automatically convert archived and digitizedpaper-based discharge summaries into the CDA documentstandard to be used as research knowledge base. Apartfrom the transformation module, the system consists ofa Tesseract-based OCR (Optical Character Recognition)module3 for converting scanned images of the discharge

3http://code.google.com/p/tesseract-ocr/

Fig. 8 A region section of a rule matching the ‘recordTarget’ template

summaries to machine-readable text strings and an anno-tation module based on GATE4 to extend the simple textstring with annotations of PHI elements, i.e., PersonalHealth Information including names, addresses, and otherpatient-identifying information [22].

For the sake of this evaluation, we focus on examiningthe following CDA document segments:

– Header (normalized sections): recordTarget (patient’spersonal information), author (usually the treatingphysician), informationRecipient (physician receivingthe discharge summary).

– Body: reason for visit (incidents, problems), diagno-sis (results and outcome), procedures (medications andtreatments), plan of care (recommended medications).

We investigated 100 documents from two different wards(internal medicine and surgery) of the same hospital.Although from the same hospital, the layouts of the doc-uments differed in multiple aspects. This resulted in the

4http://gate.ac.uk/

http://code.google.com/p/tesseract-ocr/

http://gate.ac.uk/

62 Health Technol. (2013) 3:51–63

need for different rule sets for each layout. These rule setsonly required localized changes to factor in, e.g., modifiedpatient blocks. Thus, the majority of the rule elements couldbe reused.

Table 1 illustrates the evaluation results. As the mainperformance indicator of the transformation process is howwell the regions-of-interest are identified, we opted to notevaluate the f-measure for all entity classes (first name,last name, house number, etc.) individually. Instead, wefocused on tracking the number of incomplete or erroneousCDA document blocks (segments), i.e., if a particular blockmissed a single entity, it was regarded as erroneous dueto incorrectly identified regions-of-interest. The evaluationshowed a significantly higher percentage of total errors inthe CDA header blocks (24 %, 17 %, 34 %) than in theCDA body blocks (2 %, 3 %, 9 %, 14 %), although at adifferent ‘granularity’. Usually body blocks were filled bysimply copying whole paragraphs (regions-of-interest) intothe corresponding sections, thus, errors resulted in emptytemplates. CDA header sections, however, required theextraction of particular named-entities only, which meansthat the logged errors were the result of a single or twomissing elements, such as a last name or street name in anaddress block.

An in-depth analysis of the errors revealed the causes: Ascan be seen in Table 1, we classified the errors into OCR(errors due to incorrectly identified characters and thus mis-spelled words or incorrectly identified text blocks), anno-tation (incorrectly annotated or missing named-entities),and actual errors due to limitations of the transformationsystem. Furthermore, some elements were simply miss-ing in the source documents (informationRecipient, plan ofcare). The predominant OCR errors in the recordTarget andauthor sections were the result of misspelled first/last names(interfering written signatures and greyed background boxesin the source document), which were in turn not anno-tated and, thus, not identified by the transformation system.While the CDA header rules were largely composed of tag-specific conditions, the CDA body blocks relied on static

Table 1 Evaluation results

Total OCR Annot. Transf. Not in doc.

recordTarget 24 12 11 1 0

author 17 12 4 1 0

informationRecipient 34 1 10 1 22

reason 2 2 0 0 0

diagnosis 3 3 0 0 0

procedures 9 3 0 6 0

plan 14 1 0 3 10

known keywords (e.g., ‘Current Status’), which facilitatedthe correct identification of those regions-of-interest.

The analysis of the remaining errors in the CDAbody sections revealed the limitations of the transformationsystem: The nine missing entries for procedures and planoccurred due to an incomplete keyword set. Other problemsoccurred when handling unexpected changes, such as key-words (‘to be sent to’) left out in the source document orstructural variations (document author cited at the begin-ning of the document instead of at the end). In general, thebetter the quality of the OCR and annotation results, thebetter is the performance of the transformation system. Theoption to use either or both the source document’s struc-ture (annotation tags) and content (regular expression) toidentify the intended regions-of-interest allows to adapt thetransformation system to different annotation engines withvarying annotation quality (e.g., compensating for missingannotation classes).

Realizing these limitations, especially the informa-tion missing in the source documents in the first place,we extended the transformation system to make use ofadditional information sources. This also included the cre-ation of explicit rule sections for these additional sources,including:

– <documentSource>: original annotated input string assource

– <metaSource>: health document metadata provided bythe document archive system

– <externalSource>: external information source such asexternal databases, hospital information systems, etc.

– <fixedSource>: static information valid for all docu-ments of a particular type

The structure of the three new sections is very similarto that of the <documentSource> with the exception thatno regions exist but only mapping elements. Health meta-data documents were persisted along with the paper-basedhealth documents in the archive and contain administra-tive information such as the patient’s name, social securitynumber, and the treating physician, in a simple key/valuepair structure. These administrative data were particularlyhelpful in completing and verifying the results of named-entities in the header section. The <externalSource> refersto an interface that interacts with information providerssuch as external databases or hospital information systems.Thus the rule elements of the <externalSource> sectionexhibit a (database) query-like structure with query parame-ters. Finally, the <fixedSource> simply refers to some basicinformation that remains static for a specified set of docu-ments such as all discharge letters from a particular hospitalward. In that case, administrative data such as the hospi-tal’s address, the ward’s identifier, and telephone numbersare obviously alike for all documents.

Health Technol. (2013) 3:51–63 63

6 Conclusion

The differences in todays’ health document formats con-siderably reduce the efficiency of data sharing. The HL7CDA provides a standardized framework to represent multi-ple document types including radiology results or dischargesummaries. This work presented a rule-based transforma-tion system to automatically convert semi-structured sourcedocuments into the CDA standard. It is designed to (i)focus on the actual transformation process to be combin-able with different annotation engines, (ii) facilitate the rulecreation process for non-technical domain specialists with-out learning complex formalisms, and (iii) make use ofboth structural and content-related knowledge to formulatethe rules. Further work includes the extension of the rulesystem to allow for more powerful rule conditions, the intro-duction of pre-specified referenceable utility-functions tofurther simplify the rule specifications, and the evaluationwith different annotation engines.

Conflict of interest The authors declare that they have no conflictof interest.

References

1. Appelt DE, Israel DJ. Introduction to information extractiontechnology. A Tutorial Prepared for IJCAI-99; 1999.

2. Chaudry B, Wang J, Wu S, Maglione M, Mojica W, Roth E,et al. Systematic review: impact of health information technologyon quality, efficiency, and costs of medical care. Ann Intern Med2006;144(10):742–52.

3. Cimiano P, Ladwig G, Staab S. Gimme’ the context: context-driven automatic semantic annotation with C-PANKOW. In: Pro-ceedings of the 14th international conference on world wide web.2005. p. 332–41.

4. Cunningham H, Maynard D, Bontcheva K, Tablan V, AswaniN, Roberts I, et al. Developing language processing componentswith GATE version 6 (a User Guide): University of Sheffield,Department of Computer Science; 2011.

5. Ferrucci D, Lally A. Uima: An architectural approach to unstruc-tured information processing in the corporate research environ-ment. Nat Lang Eng 2004;10:327–48.

6. Friedlin J, McDonald CJ. A natural language processing systemto extract and code concepts relating to congestive heart fail-ure from chest radiology reports. In: AMIA annual symposiumproceedings. 2006. p. 269–73.

7. Friedman C, Alderson PO, Austin JH, Cimino JJ, Johnson SB.A general natural-language text processor for clinical radiology. JAm Med Inform Assoc 1994;1(2):161–74.

8. Friedman C, Johnson SB, Forman B, Starren J. Architecturalrequirements for a multipurpose natural language processor in the

clinical environment. In: Proceedings of the Annual Symposiumon Computer Application in Medical Care. 1995. p. 347–51.

9. Grouin C, Rosier A, Dameron O, Zeigenbaum P. Testing tac-tics to localize de-identification. In: Medical informatics in Europeconference (MIE’2009). 2009. p. 735–39.

10. Guo Y, Gaizauskas R, Roberts I, Demetriou G, Hepple M. Iden-tifying personal health information using support vector machines.In: i2b2 workshop on challenges in natural language processingfor clinical data; 2006.

11. Haug PJ, Ranum DL, Frederick PR. Computerized extractionof coded findings from free-text radiologic reports. Radiology1990;174(2):543–48.

12. Health Level Seven Inc.: Hl7 version 3. 2007. Online http://www.hl7.org/v3ballot/html/welcome/environment/index.htm.

13. Integrating the Healthcare Enterprise (IHE): IHE IT infrastructure(ITI) technical framework 6.0. Tech. rep. 2009.

14. Long W. Extracting diagnoses from discharge summaries. In:AMIA annual symposium proceedings. 2005. p. 470–4.

15. Morrison F, Li L, Lai A, Hripcsak G. Repurposing the clin-ical record: Can an existing natural language processing systemde-identify clinical notes? J Am Med Inform Assoc. 2009;16:37–39.

16. Mystre SM, Savova GK, Kipper-Schuler KC, Hurdle JF. Extract-ing information from textual documents in the electronic healthrecord: A review of recent research. IMIA Yearb Med Inform.2008;1:128–44.

17. Neubauer T, Heurix J. A methodology for the pseudonymizationof medical data. Int J Med Inform. 2011;80(3):190–204.

18. Regenstrief Institute Inc.: Logical observation identifiers namesand codes (LOINC). 2008.

19. Sarawagi S. Information extraction. Found. Trends Databases.2008;1:261–377.

20. Sibanda T, He T, Szolovits P, Uzuner O. Syntactically-informedsemantic category recognition in discharge summaries. In: AMIAannual symposium proceedings. 2006. p. 714–18.

21. Taira RK, Bui AAT, Kangarloo H. Identification of patient namereferences within medical documents using semantic selectionalrestrictions. In: AMIA annual symposium proceedings. 2002. p.714–18.

22. U.S. Department of Health & Human Services. Office for civilrights: Summary of the HIPAA privacy rule. 2003. Online http://www.hhs.gov/ocr/privacysummary.pdf.

23. Velupillai S, Dalianisa H, Hassela M, Nilsson GH. Develop-ing a standard for de-identifying electronic patient records writtenin Swedish: Precision, recall and f-measure in a manual andcomputerized annotation trial. Int J Med Inform Assoc. 2007;14:564–73.

24. World Health Organization (WHO): international statisticalclassification of diseases and related health problems (ICD). 2007.

25. Zeng QT, Goryachev S, Weiss S, Sordo M, Murphy SN, LazarusR. Extracting principal diagnosis, co-morbidity and smokingstatus for asthma research: evaluation of a natural language pro-cessing system. BMC Med Inform Dec Making. 2006;6(30).

26. Zweigenbaum P, Bachimont B, Bouaud J, Charlet J, BoisvieuxJF. A multilingual architecture for building a normalised con-ceptual representation from medical language. In: Proceedings ofthe annual symposium on computer applications in medical care.1995. p. 357–61.

http://www.hl7.org/v3ballot/html/welcome/environment/index.htm

http://www.hl7.org/v3ballot/html/welcome/environment/index.htm

http://www.hhs.gov/ocr/privacysummary.pdf

http://www.hhs.gov/ocr/privacysummary.pdf