Vagnoni Grant Proposal Submission 2 Full Word07

Embed Size (px)

Citation preview

  • 8/2/2019 Vagnoni Grant Proposal Submission 2 Full Word07

    1/19

    Fusion: Integrating heterogeneous EHR data from diverse sources.A. Specific AimsThe Institute of Medicine (IOM) reported in 2000 that as many as 98,000 people die annually as a result ofmedical errors [16]. Several years later, a more comprehensive study found that the number was more than195,000 patients dying each year from preventable in-hospital medical errors. This is equivalent to more thanone 747 jumbo jet full of patients crashing everyday. The Center for Disease Control's annual list of leadingcauses of death does not include medical errors. If the list did, preventable medical errors would show up as3rd behind heart disease and cancer [36]. Additionally, hospital-acquired infections account for more than90,000 deaths each year and are largely preventable [7][10]. Few individuals now doubt that preventablemedical injuries are a serious problem. However, there is still uncertainty among medical professionals as tothe best means to combat this epidemic [17].

    Many high-hazard industries begin combating preventable mistakes by focusing on the flaws within the systemand not by blaming and shaming the people in those systems [17]. For this reason, interest in technologies tosupport safer care by assisting in reducing the complexity and problem areas within the health care system hasincreased. Studies report that Electronic Health Record (EHR) systems could obviate many of the preventabledeaths while saving 81 billion dollars annually [14]. While EHR systems are a step in the right directioncomputerized assistance can help even further. One key assistive technology has been the computer-assistedphysician order entry systems (CPOE). In one experiment, implementation of a CPOE application leveragingan existing EHR system decreasedserious medication errors by 55%[4].

    CPOE systems aid in solving the problem, but they are not a complete solution. These systems are only asgood as the data they have access to. The current data landscape makes access to all the timely, relevaninformation incredibly challenging. At least 35 different laboratory information systems are used in the U.S. [2]Adding to the complexity, between 200 and 350 vendors offer EHR products in the U.S. that interact with anynumber of the lab systems[31][22][9]. These are just the publically sold systems; many more hospitals havedeveloped their own in-house, custom EHR systems. Thus it is no surprise today that a patients data can beeasily lost during a migration from one hospital to the next or when a hospital switches from one vendor to thenext.

    Without access to all the relevant data, these systems will never be able to prevent the crashing jumbo jetsfrom killing all the patients within. Not only does the right data need to be present at the right time, but whathe computerized systems really need is information and not data to act upon. Artifacts from different EHRsystems often distort the information and hamper more advanced systems like CPOEs from working to their fullpotential. We seek to address these short comings by integrating all the data a Decision Support or CPOEsystem needs when it is needed by transforming and integrating all of a patients EHR data into the requisiteinformation.

    While some have proposed standardization as a means of solving the integration problem, such approacheshave suffered from many hurdles. A standard has to be created (there are many competing standards today)Even if one was settled on (for instance Health Level Sevens, HL7), it would only be effective once all vendorssupported the standard. This would involve significant work and money, and currently few EHR companiescomply with the accreditations available [40]. Finally, even if all the problems were overcome, the data wouldbe syntactically integrated but not semantically integrated. That is, the representation of a concept such asfemale could be handled by designating a 2 in one hospital while as an F at another. A CPOE will not know

    that they are the same without being told.

    The other possibility is to implement a semi-autonomous algorithm to not only integrate the data from disparatesources, but to integrate the meaning or semantics of the data across organizations into a common informationand knowledge representation. Some algorithms have been created to do exactly this, the best known isBioStorm. BioStorm is an integrated knowledge representation system aimed primarily at the bio-surveillancedomain. Its chief problem is the overly simplistic and brittle assumptions made in its architecture [21]Regardless, an integration and knowledge creation algorithm will have to undergo rigorous testing beforemedical organizations will trust it with their data and put their faith in the results. Unfortunately, there has beena complete lack of validation and systematic verification of medical data integration algorithms. A goldstandard to compare against is sorely needed for the field to progress and gain credibility.

  • 8/2/2019 Vagnoni Grant Proposal Submission 2 Full Word07

    2/19

    Specific Aims:

    1. To produce an algorithm that semantically integrates heterogeneous data from diverse sources. Thisalgorithm will not only integrate the data but use the extracted knowledge to further augment those sub-systems that provide it with data and/or information. This will be a top-down and simultaneouslybottom-up process. This enables preexisting knowledge and recently encountered experiences totransform into something better than the sum of its parts. Further, we contend that existing, commonrepresentations from relational databases to textual repositories occlude all the information that ispresent in the data from outside extraction and use.

    2. To generate a gold standard to serve as evaluation criterion to be used to judge integration algorithmsagainst medical doctors, serving as domain experts. This is a very successful strategy that has as aresult of providing an objective fitness function spurred areas from natural language processing (NLP)to decision support to flourish. For the class of heterogeneous integration algorithms, particularly withinthe medical community, no such gold standard exists. The inability to objectively compareimplementations of disparate, medical data integration is a real rate limiting factor and one pointed oncommonly in the literature.

  • 8/2/2019 Vagnoni Grant Proposal Submission 2 Full Word07

    3/19

    B. Background and SignificanceMerriam-Webster defines a mistake as: to blunder in the choice of, to misunderstand the meaning or intentionof, and to identify wrongly. When a physician prescribes the wrong medicine for a patient, a blunder has beenmade that often is catastrophic. When a surgeon misunderstands the orders, the wrong body part is operatedon. When a patient is misdiagnosed, the consequences can be fatal. So fatal, that it is estimated that morethan 195,000 patients die each year from preventable in-hospital, medical errors. This is equivalent to morethan one 747 jumbo jet full of patients crashing everyday. Indeed the safety record of the health system in theUnited States lags far behind the similarly complex aviation industry. A person would have to fly nonstop fo438 years before expecting to be involved in a deadly airplane crash [19]. These numbers dont encompassthe whole problem, because in-hospital patients are a small fraction of the total patient population. Retaipharmacies fill countless prescriptions each year, nursing homes watch over the elderly and infirm, anddoctors offices met out care in an out-patient setting. The 195,000 patients dying each year are in-hospitapatients, the tip of the iceberg. No one knows how deep the iceberg extends beneath the United Statestroubled waters.

    B.1 Medical ErrorsThe American Hospital Association (AHA) lists the following items as common types of errors: incompletepatient information, unavailable drug information, and miscommunication of drug orders [19]. Patieninformation is incomplete and not accessible in its entirety at the point of care. A patients allergies or all themedications being taken might hide in the pharmacys database. Previous diagnosis for past illnesses o

    additional current ailments might reside in another doctors electronic health record (EHR). Pending lab resultsmight be tucked away in a third-partys laboratory information management system (LIMS). At least 35different LIMS are used in the United States and 20,580 labshave been certified by the U.S Department oHealth and Human Services. It is common for practices to receive results for a patient from multiplelaboratories. What makes matters worse is that between 200 to 300 vendors offer ambulatory EHR productsinthe U.S. [28], on top of the number of custom, in-house EHR systems that exist.

    Another problem area is the current disconnect between warnings that are released by the FDA orpharmaceutical companies and the patients taking the drug. Doctors arguably are supposed to be that linkbut they can be caught unaware for many reasons. Time is a scare resource, and the doctors and their stafare often overworked. The method of dissemination can introduce difficulties. Paper notices can get lostemails can get put in a spam folder, or websites can go down. Even if time and dissemination dont fail, a

    doctor must remember all the drugs from all of the patients under their care, which is simply not possiblewithout using an electronic system more specialized than the EHR, called the computer physician order entrysystem (CPOE). The CPOE allows medical staff to electronically order drugs and automatically check fointeractions, complications, and warnings. However, a CPOE system is only as good as the information it hasto work on.

    Finally, the orders for drugs themselves can be mistakenly communicated. Name confusion is the mosheinous example, says Peter Honig, M.D. and FDA expert on drug risk-assessment [19]. Take the sound-alikenames of Lamictal, an antiepileptic drug, with Lamisil, the antifungal drug. The volume of dispensing errors othese two drugs was so vast that campaigns to warn physicians and pharmacists of the potential confusion hadto be waged. It still hasnt stopped epileptic patients from dying when they are erroneously given the antifungal drug Lamisil instead of Lamictal. The FDA has clamped down on sound-alike names of drugs, but the

    problem is far from eradicated. A CPOE system can only go so far in assisting to increase the specificity of thedrug ordered. Sloppy handwriting and miss dosing might disappear, but until the CPOE system canunderstand the semantics of a patients condition and the drugs that should be used to treat will there be anytrue progress in eliminating these types of mistakes. After all, a CPOE system would not prevent a doctor fromtreating an epileptic seizure with antifungal medicine, because a patient might have both conditions. If a CPOEsystem had access to all the information relevant to a patient, unlike the current state, it could realize thepatient is in the hospital for a seizure episode based on the chief compliant entered during the patientsadmittance, and further know that fungal drugs are not ordered to be used by nurses in hospitals but rather asan outpatient treatment.

    B.2 Perspectives of Error

  • 8/2/2019 Vagnoni Grant Proposal Submission 2 Full Word07

    4/19

    While many errors are being committed everyday, who is to blame? The state of malpractice suits might arguethat the doctors and staff are at blame. However in reality, medicine is practiced within a very complex systemThis complexity coupled with the limitations of humans make avoiding mistakes an extremely demanding task.

    In its landmark report, To Err is Human, the Institute of Medicine (IOM) suggested moving from the traditionaculture of picking out an individuals failings then blaming and shaming them [17][19]. The IOM asserts thasystems failures are at the heart of the problem rather than bad physicians. This is a crucial scientificfoundation for improvement of safety found in all successful high-hazard industries. Errors are seen asconsequences rather than causes, having their origins in upstream systematic factors. The central idea is tha

    of system defense within the complex environment of health care. In aviation maintenance an activity whosecomplexity is close to that found in medical practice, 90% of mistakes were judged to be blameless[23]. Ovethe past decade, research into error management determined two essential components for managing unsafeacts: limiting the incidence of them and creating systems that are better able to tolerate occurrences of errorsand contain their damaging effects. As a result, interest in technological solutions to manage the complexity othe health care system have surged [19].

    B3. ComplexityHealthcare is more complex than any other industry in terms of relationships. There are over 50 different typesof medical specialties which then have further subspecialties interacting with each other and with an equallylarge array of allied health professionals. The more complex a system is the more chances it has to failFigure 1 demonstrates the complexity of one routine activity at a pediatric cardiovascular surgical care unit in ahospital.

    Medical care today depends on the interactions between many different individuals and parts of the healthcaresystem. As complex as health care is today, it is just as fragmented. There is an over emphasis on focusing

    Figure 1: Process map in Healthcare and Between Healthcare and Secondary [3]. Pediatric cardiothoracic surgical teammembers

    at two large urban medical centers delineated the steps of

    care from the patients perspective, starting with

    the

    referral for surgery until the childs first post -discharge

    follow up visit. Notice that different groups of individuals withdifferent backgrounds and levels of experience are involved and this process crosses multiple organizational boundaries.

  • 8/2/2019 Vagnoni Grant Proposal Submission 2 Full Word07

    5/19

    and acting on the parts without looking at relationships to the whole: the patients and their health [25].However, complex systems are more than the sum of their parts [34][18]. Unfortunately, specialization with itscommensurate focus on only a narrow portion of a patients medical history has exploded without acomparable growth in the ability to integrate the information surrounding a patient. As a result, our ability toturn data into information and information into knowledge has diminished [25]. Thus while the US continues tomake advances technologically, its healthcare system remains fragmented, ranking 37th in performance toother systems in the world[35]. Current information technology systems focus narrowly on care of individuadiseases, rather than higher level integration of care for prevention, mental health, multimorbid conditionsand acute concerns [12][26][27]. Knowledge generation is narrowly partitioned in disease-specific institutes

    and initiatives without sufficient balancing research that transcends these boundaries.

    B4. Data the Rate LimiterHealthcare suffers from data explosion [1,2]. US Government Census, [6,7], indicated that in 2002 there were6,541 hospitals and in 2005 there were 7,569. There were 33.7 million hospital inpatient discharges in 2002on any given day, there are 539,000 hospital inpatients, nationwide, excluding newborns. Each patientgenerates a multitude of structured lab reports and unstructured summary notes by ER, nurses, and doctors.There are one or more EHR systems for a patient at each of the different hospitals a patient might have visitedduring the course of his or her life.

    Figure 2: Different types of datain a Hospitals Clinical Data Center

    [1] illustrates the diverseecology of todays clinicaldata center. It is commonfor departments to generatethe same type of data buthave it provided by differentvendors or custom systems.

    B5. Standards arent thesolution

    Standards have been

    around for more than a 100of years. These earlystandards were for singleand simple things like thewattages of a light bulb orthe spacing of threads on ascrew. The complexity ofsystems and industriesrequiring standards hasgrown significantly sincethen. An EHR, for example, is so complex it cannot be described in a single standard, but will requirehundreds of new and existing standards[32][33]. Furthermore there are many competing standards

    organizations that need to be involved such as: The International Standards Organization (ISO), Clinical DataInterchange Standards Consortium (CDISC), International Healthcare Terminology Standards DevelopmenOrganization (IHTSDO), the World Health Organization (WHO), National Council for Prescription DrugPrograms (NCPDP), Health Level Seven (HL7), Health Information Technology Standards Panel (HITSP), theEuropean Committee for Standardization (CEN), and ASTM International. Each of these organizations hastheir own working groups on different aspects of the EHR such as CENs: Data Structure, Messaging andCommunications, Health Concept Representation, Security, Health Cards, Pharmacy and Medication, Devicesand Business Requirements for EHRs. Indeed, Charles Jaffe, M.D. the CEO of HL7 is quoted as sayingThere is not a lack of standards, there are, in fact, too many standards and too many organizations writingthem[15]. Sam Karp, Vice President of Programs, California Healthcare Foundation testified to the Institute ofMedicine as follows, Our experience with the current national data standards effort is that it is too slow, toocumbersome, too political and too heavily influenced by large IT vendors and other large institutions The

    Figure 2: Different types of data in a Hospitals Clinical Data Center [1]. Analog datafrom ECG Waveforms and sound, binary data from images and videos, structured datafrom the EHR systems from various departments, and free-text from notes and reviewsare all present for each patient seen at each hospital a patient has been to. Differentdepartments have different types, formats, and intent behind their data. These oftenform silos that must be breached to allow for true integration of the patients medicalhistor .

    http://www.cdisc.org/http://www.cdisc.org/http://www.cdisc.org/http://www.cdisc.org/
  • 8/2/2019 Vagnoni Grant Proposal Submission 2 Full Word07

    6/19

    current approach to standards development is too complex and too general to effectively support widespreadimplementation.

    Assuming at some point in the next five years a national EHR standard is reached, then the task of compliancemust be surmounted. This is a rather daunting task, because of the sheer size of the Healthcare Industry inthe United States. There are at least 50 different medical specialties and subspecialties at more than 7,569hospitals being served by over 300 private EHR vendors and/or custom built EHR system [28][5]. Thesehospitals are served by the 20,580 laboratories each using a version of a LIMS from the over 35 differentvendors. Then there are the varied pharmacists using their own systems. Vendors must be convinced toimplement the new standards with little benefit. The additional cost is not compensated for directly. Adoptionis a significant hurdle in and of itself [32][33].

    When all of this has been accomplished, data interoperability will finally be realized. However, the integrationof this data will only be superficial. While the structure of the data might be dictated by the standards, thecontent of the data or the information within the data remains isolated. One vendor might represent theconcept of human female by the numeral 2 while another represents it as the letter F and a third as areference to another location (demographics table for instance). Even higher level concepts like the history oall medications taken by a patient can be represented in an uncountable number of ways while still complyingwith the letter of the standard. Worse, a standard cannot account for all data a vendor might have, so oftenthere are locations within the standard for optional data. All of these factors make true interoperabilityextremely difficult, even when all parties comply with all the diverse standards that exist.

    B6. Grid-based approach

    The Novo Grid uses an agents as distributed software entities to exchange information between people andapplications. Agents are task-specific software programs that can be installed in any location. The idea behindthe concept is to create small, modular programs interacting through a workflow activity. This both encouragesefficient composition and reduces design time. Once installed, agents could theoretically interact with theilocal environment to extract and insert information and exchange the information with other agents to automateworkflow tasks. However, the agents are still specific instantiations of domain knowledge towards a specificdata source at a specific location. Even though the agents are small and light-weight pieces of code, this is stilno guarantee on their efficiency and usefulness. The most important thing from a data integration stand poinis what data they were created to interact with.

    For example, an agent installed on a server in a hospitals data center can listen for HL7 result messages fromthe hospitals interface engine. Without a mechanism of formalized representation of knowledge about whatdata is, what the formats for the data are, what the domains constraints are, and what the environmentsspecifics are it is difficult to make this process efficient. Its programming will be overly specific to a narrowlydefined input and transformed output, and not allow for the degree of generalizability necessary to overcomethe exponential combinatoric problem that exists within the health care system. Namely any party can wandata from any other party and any time.

    Finally for inter-agent communication a blackboard or post office metaphor is implemented. The Novo Gridprovides a messaging post office that allows agents to send information from one to another. However, therestill must be a common mechanism for communication, which for the Novo Grid is a vendor specific standard

    That is, the agents communicate through message passing that they have been pre-programmed tounderstand. There is however an alternative. Knowledge about the system could be explicitly represented sothat it is comprehensible to both humans and machines.

    B7. The Semantic Web

    The semantic web is an explicit, graph-based representation of knowledge that is both human and machinecomprehensible and computable. This framework is designed to enable data to be shared, federated, andenhanced regardless of system boundaries. The semantic web was designed around the Open WorldAssumption that truth is independent of whether it is known by any single entity. That is, the graph-basedknowledge formalized into an ontology can be extended by another entity. The language used to represenknowledge is called the Resource Description Framework (RDF). Its basic representation of a statement o

  • 8/2/2019 Vagnoni Grant Proposal Submission 2 Full Word07

    7/19

    knowledge is the triple: a subject, predicate, and object. Subjects and predicates are resources that refer toconcepts that via the Open World Assumption can be extended or changed at any time. Objects can either beresources or literal values (text, numbers, dates, etc.). Resources are both unique and globally identifiable byusing a Universal Resource Identifier (URI), of which a Universal Record Locator (URL) from the internet is asubset.

    B8. BioSTORM (Biological Spatio-Temporal Outbreak Reasoning Module)One of the most well known attempts at semantic integration of disparate date is BioSTORM, a system forautomated surveillance of diverse data sources. The system is agent-based, like the Novo Grid, with a three

    layered architecture consisting of a Knowledge Layer, an Agent Platform, and a Data Source Layer. TheKnowledge Layer is composed of a library of problem solving methods and a collection of ontologies definingthe system components. The problem solving methods are implemented either by BioSTORM or domainexperts according to an Application Programming Interface (API). The ontologies in this layer describe systemvariables, the problem domain, and deployment configurations. The Agent Platform is the middle layer, and iprimarily deploys agents based upon the specifications in the Knowledge Layer. The agents act according tothe knowledge describing the tasks and problem domain. The Data Source Layer describes the environmenthrough an ontology. This provides a binding between the tasks the agents are to undertake and the actuaphysical resource [21].

    However, BioSTORM has several shortcomings. It is focused on the Bio-Surveillance area within publichealth. Their data broker component is concerned with mapping from the raw data to the needs of a particular

    problem solving method from the Knowledge Layer. This does not scale, because a particular method mighrequire any given subset of data within the cloud of all patient EHR data. There are simply too many vendorssupplying EHR systems to a vast array of different users who have any number of different data sources andtypes. Then couple this with all the possible uses the patient, the doctors, the researchers, the governmentetc. might have for the data. Furthermore, this methodology requires a priori knowledge of the data sourcesThe implementer of an agent using the data broker must know what they are looking for and where it islocated. It does not allow for exploration and change of the data present.

  • 8/2/2019 Vagnoni Grant Proposal Submission 2 Full Word07

    8/19

    C. Preliminary Studies

    C.1 Modeling DataThe semantic web will be used to enable extensions and enhancements to the data. An ontology is a formaspecification of a conceptualization. It describes kinds of entities and how they are related verses a dataschema which speaks more to message formats (syntax) than knowledge representation. Ontologies alsoprovide generic support for a machine to reason on by making data machine understandable (historically it hasbeen only human comprehensible). It defines individuals, classes, and properties within a domain, formally

    organizing knowledge and terms. Ontology Web Language, OWL, is the mechanism for doing just this.Statements are asserted about a particular subject in the form of a triple composed of a subject, a predicateand an object. From these triples or statements, logical consequences and facts entailed by the semantics canbe derived. Universal Resource Identifier, URI, (University Resource Location, URL, is a sub-type of URI) isused to specify unique ids for a subject, its properties, and in some cases an object of the statement. It is theuse of URIs that makes it easy to extend ontologies by uniquely identifying a concept, taking into consideration

    what others have asserted about the same URIIt is precisely these features that we leverage inthe proposed Fusion System.

    C.2 Integrated Department InformationService (IDIS)The mechanism of integration was used toaugment the extraction of information fromdifferent data sources. The Intra-DepartmentaIntegrated Search (IDIS) project integratedhierarchical (database) and textual (web pages)search results, mutually reinforcing each otherThe goal was to process a domain(www.cs.uiuc.edu) and enable better searchresults by leveraging all the data within thatdomain from its relational databases (RDBS) tothe textual data. We placed the RDBS datawithin an Oracle database, which comprisedthings like faculty information and student rostersKnowledge about the database was extractedfrom metadata (i.e. the schema) and insighprovided by the textual processing. The textuaprocessing began as an automated crawling ofthe www.cs.uiuc.edu domain. Once all the htmpages were collected, an open source text searchengine was employed to extract phrases. Theresults were integrated via a web-based GUI.

    IDIS integrated across three different dimensionsFirstly, it not only combined results from both arelational and a text database, it was able toleverage the results from one to get higher qualityinformation from the other. After the usersKeyword Query was transformed by the SQLQuery Composer into an appropriate query to therelational database, these results were utilized toprovide the context upon which textual resultswere extracted from the large corpus ofdepartmental web pages. Then the textuaresults drove how the structured results from thedatabase were presented inside of thePresentation Layer. This synchronicity was one

    Figure 3: The IDIS System received queries from theinternet, which were processed into SQL Queries. Theresults of the SQL Queries were then used to extractcontextualized and relevant results from the textdatabase, which contains all the department s internetpages. Some of the data is injected into the relationaldatabase, while the majority remains in the textdatabase. Most of the data from the relational databasecame from the departments structured data. Thepresentation layer not only displays the results butassists in future querying by providing context to boththe user and the system.

  • 8/2/2019 Vagnoni Grant Proposal Submission 2 Full Word07

    9/19

    of the key components we are going to utilize in the currently proposed research. There is information that isuniquely embodied in the textual data that simply cannot be conveyed in the structured, relational databaserepository. For example, RDBMS traditionally do a poor job efficiently storing the often sparse structure of textwhich makes it extremely difficult to interact with. Likewise, the relational database has information explicitlyrepresented in its structure (particularly its schema) that cannot be represented in the typical webpage.However, both of these hurdles are solved in xml for humans and made machine understandable in thesemantic webs RDF/OWL ontologies.

    Another very relevant area of integration was across time. IDIS used past context to enhance current search

    The history of a users search and interact ion with the site provided a trajectory through which current needswere channeled. For instance while trying to prioritize a users query results, past activity within the area ofcourses offered by the college would cause professors related to those past searches to be ranked higher anddisplayed more prominently. Finally, search was integrated with navigation. Indeed, how the user interactedwith the results provided useful information for current and future searchingas well. A user might drill down ona professors research interests to then find classes related to those interests. These two cases areparticularly interesting, because bothcan be conceived as trajectories or graphs of a users behavior. In bothcases, the users current actions extend the graph representing all past interactions with the informationcontained across textual and relational databases. This is one of the reasons why using the semantic web, agraph-based representation of data, should be so powerful in our proposed project. Many of these synergiesare easier and/or augmented by using a representation that more closely approximated the phenomenon weare trying to elicit.

    From the users perspective, the IDIS system completely abstracted the representation of the data from usersinteraction. The users of the system did not need to know the structure of the databases nor even how tointeract with them (i.e. SQL Querying Language). Rather the user interacted with the system and requestedadditional information from the system by navigating through the relationships between the information. Insome cases that information was present in the users head (via keyword queries), but in other cases it waspresent in the representation of the information to the user by the system.

    C.3 Computerized Medical SystemsWhile at Computerized Medical Systems, I was responsible for creating a new system of integrating medicaand health data from within the company and across its partners. This project differed from IDIS in that it dealwith a wide assortment of medical data from digital images to demographic information to medical records.

    Each application had its own particular means of representing the data. Much of the companys future hingedon being able to access data uniformly from anywhere, because of the exponential growth of complexity asnew applications were introduced that then needed to interact with existing applications. When everyapplication needs to potentially interact with every other, typical methods using standards and messagepassing just did not scale. This burden was threatening to halt future application development. Worse thehurdle for a collaborator to work with our products was even more insurmountable, because at leastdevelopers within the company had a familiarity with internal standards. However, giving away proprietaryformats represented a problem in both intellectual property and time. Even if there was no concern oinadvertently giving away a secret, collaborators would be burdened with spending the time to understand longspecifications about each products data requirements.

    C.4 XML-to-Ontology (X2O)

    My work at the School of Health Information Sciences (SHIS) has given me over 18 months of experience withall the technologies needed to complete the Fusion Algorithm. I will be presenting the work I have done whileat SHIS at the upcoming SemTech 2009 conference titled, What is in XML: Automated Ontology Learning andInformation Transformation from Continuously Varying Clinical Data.

  • 8/2/2019 Vagnoni Grant Proposal Submission 2 Full Word07

    10/19

    C.4.1 Raw DataAt SHIS, I have been researchingand improving upon existingalgorithms to integrate theElectronic Medical Records(EMRs) from 7 local hospitalsEach hospitals data is differen

    from each others and representsinformation from its emergencydepartment and patient physicianvisits. The data are received inunstructured XML (see figure4)The schema for the XML isunknown and the structure canchange without notice. Data arereceived as a stream from eachhospital, with a packerepresenting a snapshot of whaoccurred in the previous ten

    minutes. Over 4 millionmessages have been receivedduring the past 4 years.

    It is important to note that sincethere exists a triviatransformation of a database intoxml, X2O can operate on bothXML and Relational DatabasesRelational Databases are actuallyeasier than XML, because with adatabase structure is

    necessitated and a great amount of latent information resides within its required schema. Comma delineatedlists, another common file format, can also be trivially converted into an XML document. Outside of binary fileformats, starting with XML has proved to be the most comprehensive starting point for most of the common,structured data formats. Further, X2O does not need a XML Schema Definition (XSD) for the XML iprocesses. Again, the mining of relevant information from such supplied structure, should it be available, ispotentially very exciting; however, it isn't necessary, because we perform well without it. There is no real needto "standardize" the input data, because the ontologies describing the data present in X2O serve as adistributed mechanism for translating the data from machine incoherence into machine comprehension.Increased human readability and understandability is also a by-product of the X2O process.

    C.4.2 The MappingAs discussed during the introduction, the mappings are done through ontologies. The inherent, open-world

    assumption around the Semantic Web means that adding further semantics, statements about concepts, is atrivial matter. The technology has such expectations built into it at every point, so when there is an updated oreplacement to a medical ontology (UMLS for instance) there is only a growth in understanding by themachines that process the concepts and no additional work on a humans part to update anything.

    There are four levels of ontologies in X2O. The first level defines at its most atomic what it means to be datasuch as concepts and descriptions of those concepts. It is called the Core Data Model, and it is the foundationupon which the next level resides. For each type of data from XML to databases, there exists a model thaexpresses all the nuances of that particular medium, Data Specific Model. The third layer is the DomainSpecific Model. It creates a mapping from within the domain leveraging what the system explicitly knows fromthe previous levels to enable semi-autonomous mapping. A domain expert describes the elements and

    Figure 4: Sample of the XML data received every 10 minutes. Thecontents of the XML are subject to change without notification.

  • 8/2/2019 Vagnoni Grant Proposal Submission 2 Full Word07

    11/19

    attributes from the perspective of the XML ontology. The important distinction, however, from other efforts isthat an exhaustive mapping of every possible value is not necessary. Rather, patterns are described within thedata based upon location within a snippet relative to other elements and attributes. This flexibility captures alof the mappings traditionally found, but does so more elegantly and extensibility. If a new value appears withinthe XML, because of this feature, usually it is still captured.

    The fourth and final ontology level could be expressed in the Semantic Web languages, but for efficiency it wasexpressed in the programming code. This is called the Action Model, and it provides the intentionality uponwhich the XML model, in this case, is operationalized. This is beyond just the programming necessary to

    control a computer's manipulation of the data. Within the code, lies a model that expresses the statementsand meaning behind each element and attributes encountered within the hospital data streamed through oursystem. Answering what can be said about a fundamental class located within the data that will be describedwithin the TBox verse the ABox.

    C.4.3 PreprocessingAs the raw data is received, it isn't automatically processed into its final form. Rather, there exists anintermediary format that is quite novel. For each XML element, including all attendant attributes, a node iscreated. At the end of this pass, a hierarchical tree structure has emerged that has indexed and re-groupedthe entire message. Because an xml element might not actually be meaningful but rather some attribute withinthe element is, this preprocessing essentially allows for a transformation of the message into the conceptualrepresentation that X2O will explicitly construct.

    C.4.4 Construction of TBox and ABoxThe final stage of processing actually sees the data bifurcated into a TBox and an ABox. One particularlyinteresting property of X2O is that regardless of whether the XML possess any metadata, the system does abeautiful job of elegantly creating a TBox for it. The TBox is compact but as demonstrated in priorexperiments, domain experts agree it accurately describes the concepts present. The ABox generated is arepository of the instances already present within the XML, and as such is a more meaningful form of that datathan when it resided within the XML. This is because the ABox has further meaning applied to it through thegenerated TBox and the other models like the core data model and any potential outside knowledge sourcesimported.

    C.5 BLUEText

    BLUEText addresses the problem of extracting information from clinical text found within EHRs, which includestriage notes, physician and nurse notes, and chief complaints. Unfortunately, this text is often written in shorhand: lacking proper grammatical structure, prone to using abbreviations, and possessing misspellings. Thishinders performance of traditional natural language processing technologies that rely heavily on the syntax oflanguage and the grammatical structure of the text. Figure 5 illustrates the architecture of BLUEText.

    It has superior results on clinical text, which is often not only unstructured but unconstrained. That is it often isgrammatically incorrect, misspelling are rampant, and abbreviations of concepts frequently occurs. Theinformation is often compressed, and so the deviation from proper structure contains more information than ifthe sentence had be properly written. One such example is: Pt a 76 yrs old Af/Am fem. w/ bp 50/120, Hx oHA x 3 years. The sentence could be properly written as: The patient is 76 years and an African-Americanfemale. She has an initial diastolic blood pressure of 50 beats per minute and a systolic blood pressure of 120

    beats per minute. She has had three heart attacks that we are aware of.

  • 8/2/2019 Vagnoni Grant Proposal Submission 2 Full Word07

    12/19

    Figure 5: Schematic depiction of the BLUE-Text processes and ontologies. Existing knowledge sources such as UMLS is

    leveraged by the Semantic Knowledge of the system. Additionally, the Syntactic and Domain Knowledge are all used

    when generating the Conceptual Graph. After a preliminary text preparation phase, the Minimal Syntactic Parser first

    parses the text and then analyzes the syntax. The results of the syntactic analysis forms a parse graph that is comprised of

    tokens of text mapped to the Syntactic Knowledge. The Semantic Interpreter performs automated ontology mapping:

    using the parse graph to map from the tokens derived from the original text to the concepts from the domain knowledge.

    Also, a separate algorithm indexes these tokens semantically. The interpreter uses this knowledge and the parse graph to

    construct a conceptual graph that represents the content of the input text. The Conceptual Graph is an intermediary and

    generic output that builds on the parse graph and maps it to the appropriate concepts from the domain knowledge. All

    nodes in the conceptual graph have both a positional and a semantic index that will be used by the Output Constructors toextract different formats of outputs defined by user and/or task based queries. The Semantic Interpreter uses this

    information to im rove the conce tual ra h throu h a normalization and disambi uation rocess.

  • 8/2/2019 Vagnoni Grant Proposal Submission 2 Full Word07

    13/19

    D. Research Design and MethodsThe overall research plan is summarized in the flowchart in Figure 6. Tasks on the flowchart are numbered asx.y, where x corresponds to the Specific Aim associated with the task, while y represents the specific taskitself. The same numbering scheme is used in the text of this section, for easy reference to the figure.

    D.0 Choose 100 Records:The same corpus of data from our prior work (C.4.1) will be leveraged for thisexperiment as well. As a result of our need to validate our results and create a gold standard, it wasimportant to devise a suitable number of patient records to provide generalizability while not overwhelmingthe human medical experts. The records will come equally from both the Cerner systems and the

    Allscripts data. We will attempt to keep an even distribution across age, race, gender, and primarydiagnosis (ICD9 code).

    D.1 To produce an algorithm that semantically integrates heterogeneous data sources.D.1.1 XML-to-Ontology (X2O): Thissystem will be used with little modification. The system

    architecture currently fits into conceived architecture of the Fusion Algorithm. The Fusion Algorithmtakes as input the resultant TBox and ABox generated after X2O completes its processing of theraw XML messages.

    i. Results: Ontologies that can serve as input for the Fusion Algorithm. Also this portion of thesystem controls what data the BLUEText (BLU) sub-system processes. In that context, thecorpus of text resides solely in free-text portions present within an XML message. The X2Owill determine when a section of XML is suitable to be processed by BLU based on someheuristics such as size of the value and prior information contained within the ontologies.

    Figure 6: Flowchart of activities in project. Number in boxes corresponds to task numbers used in the textin Section D.

  • 8/2/2019 Vagnoni Grant Proposal Submission 2 Full Word07

    14/19

    ii. Problems: One problem is that X2O might fail to provide BLU all the text that BLU couldprocess, thereby missing information that could have been extracted. Work and especiallyvalidation will need to be done to ensure that X2O is filtering the data BLU processesaccurately.

    iii. Alternate Approaches: There are alternatives to X2O, BioStorm is one such example. In theworst case, BioStorm could be used, but it is very unlikely that contingency would beneeded. Further, we have already discussed in section B.6 the short comings of BioStorm,and why X2O is a superior choice.

    iv. Transition: As a result of most of the work for this component already being in a complete

    state, any modifications that need to do be done can occur concurrently with modificationsof BLU and creation of Fusion. That is, only minor adjustments should be necessary, andtherefore should not exist as an obstruction to timely progression.

    v. Expertise: X2O was designed by the PI andother investigators on the grant.

    D.1.2 BLUEText (BLU): Currently BLUETextprocesses text into ontologies, but not in the formatrequired by the Fusion Algorithm (TBox and ABox).The refactoring to accomplish this will be significantbut well-known and straight-forward. There are noexpected difficulties in this rework. Fortunately aswith the XML-to-Ontology algorithm, we were the

    creators of BLU, so we are confident of ourassessment of the scope of the modificationsrequired.

    i. Results: Some rather significant workremains until BLU will be in a state tocontribute to Fusion.

    ii. Problems: The major concern is that thescope of the changes necessary for BLU tocomply with the input expected by the FusionAlgorithm is not accurate. This seemsunlikely based upon our extensiveexperience with the system, plus in the worst

    case the output of BLU could be treated asXML by X2O. As such, X2O could transformthe output of BLU into the output expected byFusion.

    iii. Alternate Approaches: There are many NLPsolutions and some of them specific to themedical domain that could be used.However, most medical NLP solutions areproprietary and therefore expensive. Themost well-known open solution is the NLMsSPECIALIST NLP Suite of Tools developedby the The Lexical Systems Group of The

    Lister Hill National Center for BiomedicalCommunications to investigate thecontributions that natural languageprocessing techniques can make to the taskof mediating between the language of usersand the language of online biomedicalinformation resources.

    iv. Transition: As explained in D.1.1v. Expertise: We designed the algorithm.

    D.1.3 Outside Ontologies: BLU makes use of various outside ontologies to provide much of theknowledge base upon which is operates. More importantly, the Fusion Algorithm will rely upon

    Figure 7: Overview of the system architecture

    of the Fusion System. Structured data is

    handled by the XML-to-Ontology sub-system

    through XML messages. The NLP algorithm,

    BLUEText, will process unconstrained text

    within the XML messages sent to the XML-to-Ontology algorithm. The output of both sub-

    systems is integrated within the Fusion

    Component. Finally, the results of the

    integration is used to enable both of the

    upstream algorithms to process further, while

    simultaneously enabling downstream

    applications like EHR CPOE Decision

    Support systems to function.

  • 8/2/2019 Vagnoni Grant Proposal Submission 2 Full Word07

    15/19

    Outside Ontologies to increase its knowledge of the corpus of medical data to further augment theaccuracy of the Fusion Algorithm. Furthermore, it is very important the Fusion is able to takeadvantage of outside augmentations to its knowledge source in much the same way it is able toprocess the input from X2O and BLU.

    i. Results: This is a significant demonstration of the viability of extensibility promised by theSemantic Webs design and the conceived goals of the Fusion Project. As weve illustratedearlier in sections B7 and C4, semantic federation of knowledge is itself an unsolved butvitally important area of research. It is the only way we can hope to manage the onslaughtof exponentially increasing data generated all across society, but particularly within the

    medical domain as it relates to patient health and safety.ii. Problems: Access to useful Outside Ontologies is perhaps the biggest problem. However

    the NLM publishes taxonomies (like UMLS, SnoMedCT, etc), which weve previously beenable to successfully translate into a true ontology. Additionally, there are many usefulontologies about location, demographics, measurements, drugs, etc that we would like toleverage. Weve had some past successes incorporating a measurement ontologydeveloped by NASA into BLU. We will have to analyze how far these transformedtaxonomies and outside ontologies are from the desired input of Fusion.

    iii. Alternate Approaches: We can abandon the use of Outside Ontologies without jeopardizingthe core aim of integrating heterogeneous data from multiple sources.

    iv. Transition: As explained in D.1.1D.1.4 Fusion Algorithm: At the core of the proposed architecture, resides the Fusion Algorithm. Its

    purpose is to take in the pre-processed data from several algorithms that have been previouslydesigned (D.1.1 and D.1.2) to translate a specific format of data into an ontological representation.Fusion then takes these ontological representations and integrates them into a unified knowledgebase of (all) patients information. Then an EHR, CPOE, Decision Support, etc system can build ontop of the unified view of the patients information.

    i. Results: Providing a single unified and well defined knowledge base will enable existingsolutions and research into the next generation of solutions to increase their accuracy.Additionally, production times, maintenance, and other issues typically encountered whenneeding to cope with a diverse ecology of data islands should disappear.

    ii. Problems: The core problems that Fusion needs to solve is referred to as semanticfederation or ontology integration. It is our belief that the diverse ontologies provided bytrusted, outside sources and those created by sub-systems (X2O and BLU in this case) can

    be integrated by analyzing the graph-based representation of the data. Graph analysis isimportant in a whole host of areas; one of the most well-known examples is internet search(i.e. Googles PageRank). It is our intent to use PageRank and K-Nearest NeighborsAlgorithms to analyze the graph structure that comprises the ontologies that Fusion receivesand determine a measure of similarity between the different ontologies based on both theinformation or semantics within the ontologies and the structure of the graph representingthe connection between the atomic units of information inside the ontology. It is unlikely thatan entire automated algorithm will ever exist, nor that it would be advantageous. As a resultof the catastrophic risks (patient death or serious injury), medical experts will always berequired to provide their consent to the results. However, medical experts are both rare andtheir time extremely valuable. To this end, the aim of Fusion is to vastly augment theproductivity of the doctors and medical staff involved. This is also another reason for the

    necessity of a gold standard. We will be augmenting what a human could do if they hadsufficient time to do so, so we want to ensure that we are at least as good as what a doctoror nurse might arrive at. Without a human medical expert performing the same task asFusion, it will be impossible to verify the success or accuracy of Fusion.

    iii. Alternate Approaches: Fortunately the space of graph algorithms is ripe with research.While the field of ontology integration is still very young, there are methods in the literatureof graph algorithms (particularly in the area of webpage analysis and searching) that couldbe applied to this task. One such interesting method has just recently been applied to thearea of web search that uses Level Statistics from a field called random matrix theory originallydesigned to analyze quantum systems [6]. The new method gauges the importance of words in

  • 8/2/2019 Vagnoni Grant Proposal Submission 2 Full Word07

    16/19

    a document based on where they appear (pattern of locality), rather than simply on howoften they occur. This is precisely what is needed when integrating ontologies as well.

    iv. Transition: This aim partially relies upon the completion of the D.2 aim, however, this is whywe have medical experts (physicians) assisting with early verification testing of the results ofthe Fusion System.

    v. Expertise: Several medical experts will be collaborating to help verify the results, in additionto the gold standard generated as a result of D.2. We have demonstrated, we hope, ourexpertise in the areas of data integration, ontologies, and data modeling.

    Summary: The successful completion of this aim will result in a novel mechanism for solving the problemsof data explosion both in quantity and representation. This is a real rate limiter that is stifling realization oftechnologies like EHR and CPOE systems from realizing their full potential to eliminate the vast majority ofpreventable medical errors, which are currently the number 3 killer of Americans if it was on the CDCsannual list of leading causes of death.

    D.2 To generate a gold standard, evaluation criterion for judging the success not only to evaluate ourperformance, but the many other current and future attempted solutions.

    D.2.1 Choose Experts: The initial choice of 3 experts to integrate 100 records was chosen partially inaccording with choices made in related areas of Health Informatics. Studies on the inter-rateragreement of Mesh Tagging by pubmed editors for instance.

    i. Problems: Recruiting medical experts is a notoriously difficult task, further getting experts

    interested in spending the amount of time necessary to integrate 100 patient records will bea challenge. Fortunately, we are located in Houston Texas medical center, where there is alarge pool of medical experts present.

    ii. Alternate Approaches: Unfortunately there are not good alternatives to using medicalexperts as our gold standard. It is a commonly held belief that medical experts such asdoctors, nurses, and health informaticians are what a computerized implementation shouldbe held to. Clinical Decision Support is a particularly salient example where this occurs.There are, however, alternatives to creating a gold standard to evaluate our algorithm.Some of these statistics we would use anyway during analysis of aim 1. For instance, oneof the central hypothesis is that the Fusion will augment the information providing by thecomponent pieces (X2O and BLU). Therefore, it is posited that more knowledge would bediscovered on a second pass of the initial patient data by incorporating the results from

    Fusion. Analyzing how much new knowledge (if any) is discovered could be one metric.iii. Transition: Much of the transition from choosing and recruiting to actually having the experts

    perform D.2.2 will be educating them on the task and what is required of them.iv. Expertise: Medical experts will be used to help select an appropriate cohort. Proposed

    members of the team have experience recruiting experts from past experiments.D.2.2 Integrate and Extract: Fundamentally, the experts will be performing the same task as Fusion

    but in their own way. They are free to integrate and extract concepts as they see fit. Their timespent at the computer integrating the EHR data will be recorded by video camera and also trackedwithin the program they will be using called TopBraid Composer. We have already written a pluginto the application that allows us to do this.

    i. Results: Indeed, much like a psychologist gains insight into human behavior throughobservation, it is possible that we can glean important and useful algorithmic information

    from analyzing howthe experts integrate. This should be a significant contribution beyondsimply providing the gold standard. That is, little research has been done on how a humanmedical expert might integrate disparate medical data. We hope to gain both a goldstandard and an insight into human algorithms being employed by medical experts tointegrate the data.

    ii. Problems: Actually performing the task will rely upon our training and the expertsexperience. If the experts lack sufficient medical experience to comprehend the underlininginformation in the data and transform the data into comprehensible knowledge, they couldskew the results in favor of our algorithm by brining human performance artificially down.Likewise if we fail to properly educate the experts on their task, a similar result would be

  • 8/2/2019 Vagnoni Grant Proposal Submission 2 Full Word07

    17/19

    observed. Partially, we hope that D.2.3 will help us discover problems with individuals, but itcannot uncover cross cutting bias or problems.

    iii. Alternate Approaches: There is really no alternative to using human medical experts, savefor comparing the performance of Fusion against existing systems. However, most dataintegration systems are propriety and uninterested in being declared inferior to another(rival) system. Therefore getting access to another system would be difficult. BioStorm isperhaps one of the few examples of a fairly successful and open sourced system that wecould use. This is another reason why having an independent gold standard backed byhuman experts can help put pressure on existing solutions. Purchasers and users of such

    existing solutions can assess them prior to competing, which currently is not possible in anyrigorous way.

    iv. Transition: After this phase, a number of transition points exist. If D.2.3 asserts theexpected amount of inter-rater agreement and our own experts concur, there will be no needto repeat this phase. However, if there is some sort of systematic bias that has skewed theresults, it would be necessary to repeat either D.2.2 or both D.2.1. and D.2.2. This woulddepend upon our analysis and our assessment of where the bias is originating.

    v. Expertise: As explained in D.2.1.D.2.3 Assess Inter-rater reliability: At this point, our statistician will analyze the agreement of the

    experts attempts to integrate the patients data. Additionally, our medical domain experts willassess how reasonable the results are.

    i. Results: A measure of confidence both objective and subjective of the proposed gold

    standard.ii. Problems: As explained in D.2.2.iii. Alternate Approaches: Two approaches are being used, both with their own strengths and

    weaknesses.iv. Transition: As explained in D.2.2v. Expertise: Our statistician has over 20 years experience, while our medical experts each

    have over 10 years.D.2.4 Compare Experts to the Fusion Algorithm: This final portion will entail utilizing two measures

    of similarity. The first is our data modelers assessment of both results. They both have extensiveexperience modeling data and analyzing ontologies. Secondly, the similarity methods developedfor D.1.4 will be re-employed here. Not only will this provide a means of validation (because oursimilarity methods can be compared against human data modelers) but it will assist in

    systematically and objectively analyzing the results.i. Problems: Discrepancies between the methods will help us to analyze where the failures

    are.ii. Alternate Approaches: We will be using two different methods, it should not be necessary to

    use alternative approaches.Summary: The successful completion of this aim will not only provide a gold standard for any medical dataintegration algorithm to use, but validate the novel Fusion, the heterogeneous data from disparate sourcesintegration algorithm. Even in the unlikely event that we do not succeed in all of our aims, a great deal willbe learned by this experiment. We will be collecting and analyzing how human experts integrate data,which could yield an interesting direction to explore future integration research. Further, simply having agold standard to compare existing and future algorithms is a worthy aim in itself. One such benefit isholding accountable those companies who sell their medical integration software to the claims they make.

    Additionally, enabling true competition by enabling consumers to objectively compare competition and newsolutions should exert its own evolutionary pressures. Even in failure, important lessons can be gleanedabout how to approach future attempts at heterogeneous data integration. Most existing research does notengage the very diverse and often unharmonious ecology of realistic data. It would be surprising if allexisting successful research in this area could withstand the amount of rigorous testing we will haveemployed. If Fusion fails, how it fails might be as interesting as how well it succeeds.

  • 8/2/2019 Vagnoni Grant Proposal Submission 2 Full Word07

    18/19

    D.3 Research Work Plan and Timeline

    TimelineYear 1 year 2

    Complete Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4IRB Submission & Review XD.0 Choose Representative Records XD.1.1 XML-to-Ontology XX XXD.1.2 BLUEText XX XX

    D.1.3 Outside Ontologies XD.1.4 Fusion Algorithm XX XX XXD.1.5 Recurrent Data to D.1.1 & D.1.2 XX XD.2.1 Choose Experts XX XXD.2.2 Integrate and Extract XX XX XX XXD.2.3 Assess Interrater Reliability XX XXD.2.4 Compare Experts to Algorithm XX

    Human Subjects Protection: IRB approval will be required for the secondary data use and will be obtained fromthe University of Texas at Houstons Medical CenterIRB.

  • 8/2/2019 Vagnoni Grant Proposal Submission 2 Full Word07

    19/19

    Y. Collaborators

    Principal Investigator: Matthew Vagnoni, MS brings an expertise in data integration, ontologies, andunderlying technologies such as programming, system architecture, etc.

    Investigators:

    Parsa Mirhaji, MD brings an expertise in NLP (BLUEText), ontologies, data modeling, and medicaldomain.

    James Turley, RN, PhD brings an expertise in statistics, system evaluation, and the medical domain. Jorge Herskovic, M.D., Ph.D.brings an expertise in information retrieval, clinical decision support, and

    the medical domain.

    Min Zu, MD, MS brings experience in both the medical domain and as a data modeler.