40
Structuring Medical Records with Apache Stanbol Rafa Haro, Senior Software Engineer, Athento Antonio Pérez Morales, Senior Software Engineer, Ixxus

Structuring Medical Records with Apache Stanbol · PDF fileStructuring Medical Records with Apache Stanbol ... Stanbol, Apache ManifoldCF ... Cassandra Reference Cases

  • Upload
    vokiet

  • View
    215

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Structuring Medical Records with Apache Stanbol · PDF fileStructuring Medical Records with Apache Stanbol ... Stanbol, Apache ManifoldCF ... Cassandra Reference Cases

Structuring Medical Records with Apache Stanbol

Rafa Haro, Senior Software Engineer, Athento Antonio Pérez Morales, Senior Software Engineer, Ixxus

Page 2: Structuring Medical Records with Apache Stanbol · PDF fileStructuring Medical Records with Apache Stanbol ... Stanbol, Apache ManifoldCF ... Cassandra Reference Cases

• Committer, PMC Member @ Apache Stanbol, Apache ManifoldCF

• Topics: Document Analysis, NLP, Machine Learning, Semantic Technologies, ECM

• Committer @ Apache Stanbol, Apache ManifoldCF

• Topics: ECM, Semantic Search, ETL, Machine Learning

Page 3: Structuring Medical Records with Apache Stanbol · PDF fileStructuring Medical Records with Apache Stanbol ... Stanbol, Apache ManifoldCF ... Cassandra Reference Cases

Apache Stanbol provides a set of reusable components for semantic content management. It extends existing CMSs with a number of semantic services.

CMS

Traditional Semantic

Page 4: Structuring Medical Records with Apache Stanbol · PDF fileStructuring Medical Records with Apache Stanbol ... Stanbol, Apache ManifoldCF ... Cassandra Reference Cases

Software Architecture for Semantically Enabled CM and ECM systems

Page 5: Structuring Medical Records with Apache Stanbol · PDF fileStructuring Medical Records with Apache Stanbol ... Stanbol, Apache ManifoldCF ... Cassandra Reference Cases

Apache Stanbol Story

• Started within FP7 European Project IKS (Interactive Knowledge Stack. 2009 - 2012)

• IKS project brought together an Open Source Community for Defining and Building Platforms in the Semantic CMS Space

• Incubated in November 2010

• Successfully promoted within CMS and ECM industry through IKS Early Adopters Program

• Graduated to Top-Level Apache Project in October 2012

Page 6: Structuring Medical Records with Apache Stanbol · PDF fileStructuring Medical Records with Apache Stanbol ... Stanbol, Apache ManifoldCF ... Cassandra Reference Cases

What is a Semantic CMS?

Traditional CMS

Atomic Unit: Document

Properties as meta-data (key-value schemas)

Keyword Search

Document Management Document Types

Document Workflow

Semantic CMS

Atomic Unit: Entity

Semantic meta-data (RDF)

Semantic Search

Knowledge Management Entity Management

Ontologies

Source: What Apache Stanbol Can Do for You?. Fabian Christ. ApacheCon Europe 2012

Page 7: Structuring Medical Records with Apache Stanbol · PDF fileStructuring Medical Records with Apache Stanbol ... Stanbol, Apache ManifoldCF ... Cassandra Reference Cases

Key Points• Designed to bring Semantic Technologies to existing CMS

• Non-intrusive set of RESTful ‘Semantic’ Services

• Extremely Modular: Use only the modules you need

• Main Features: • Multilingual Content Enhancement: Structure Content through Semantic

Metadata

• Knowledge Bases Management

• Knowledge Models and Reasoning

• Semantic Indexing and Search

Page 8: Structuring Medical Records with Apache Stanbol · PDF fileStructuring Medical Records with Apache Stanbol ... Stanbol, Apache ManifoldCF ... Cassandra Reference Cases

Stanbol Components• Stanbol components provide:

• RESTful API • Java APIs and OSGi services

• Stanbol components do NOT depend on each other • however they can be easily combined to

www.iks-project.eu

Page:

Apache Stanbol Service Layer

Apache StanbolComponent Layer

ApacheStanbol

Reasoners

ApacheStanbol

Enhancer

ApacheStanbol Rules

ApacheStanbol

Ontology Manager

ApacheStanbol

ContentHub

ApacheStanbol

EntityHub

ApacheStanbol

FactStoreStanbolEnhancement

Engines

VIE - User Interface LayerVIE VIE

Widgets

ApacheStanbol

CMS Adapter

Copyright IKS Consortium

6

Service-Oriented View

Page 9: Structuring Medical Records with Apache Stanbol · PDF fileStructuring Medical Records with Apache Stanbol ... Stanbol, Apache ManifoldCF ... Cassandra Reference Cases

Stanbol Components (II)• Enhancer: Extracts Knowledge from unstructured parsed content

• EntityHub: Manage Domain Entities and Topics (Knowledge Bases)

• ContentHub: Semantic Indexing / Search over your - semantic enhanced - Content

• CMS Adapter: Sync. your CMS with Apache Stanbol (JCR/CMIS)

• Ontology Manager: Manage you formal Domain Knowledge

• Reasoners & Rules: Apply Domain Knowledge to improve / validate extracted Information. Refactor / refine knowledge to align it to public schemas such as schema.org

Page 10: Structuring Medical Records with Apache Stanbol · PDF fileStructuring Medical Records with Apache Stanbol ... Stanbol, Apache ManifoldCF ... Cassandra Reference Cases

Built on Top of Apache….

• Apache Felix as OSGi environment

• Apache Sling launchers and OSGi Tools

• Apache Maven for building

• Apache Clerezza as RDF Framework

• Apache Jena as TripleStore

• Apache Solr for Knowledge Bases Management

• Apache Tika for converting input

• Apache OpenNLP for NLP Processing

Page 11: Structuring Medical Records with Apache Stanbol · PDF fileStructuring Medical Records with Apache Stanbol ... Stanbol, Apache ManifoldCF ... Cassandra Reference Cases

Integration Scenarios

Source: What Apache Stanbol Can Do for You?. Fabian Christ. ApacheCon Europe 2012

• Stand-Alone Server (Stanbol Launchers)

• Web Application (Servlet-Container)

• Embedded within an OSGi environment

Page 12: Structuring Medical Records with Apache Stanbol · PDF fileStructuring Medical Records with Apache Stanbol ... Stanbol, Apache ManifoldCF ... Cassandra Reference Cases

Project Current Status

Contributions (commits) to Trunk Since Incubation

Incubation (Nov 2010)

Apache Stanbol 0.9.0-incubating

(Aug 2012)

Graduation (October 2012)

IKS Project Ending (Dec 2012)

Apache Stanbol 0.12.0

(March 2014)

Apache Stanbol 1.0.0

(October 2016)

Page 13: Structuring Medical Records with Apache Stanbol · PDF fileStructuring Medical Records with Apache Stanbol ... Stanbol, Apache ManifoldCF ... Cassandra Reference Cases

Project Current Status (II)

• 22 PMC Members (Last Addition Jul 2016) • 26 Committers (Last Addition May 2015)

• 3-5 active committers last 2 years • [email protected]: 228 subscribers

• Activity has been gradually decreasing • 3 major releases

Source: Apache Stanbol Committee Report Helper (https://reporter.apache.org/?stanbol)

Page 14: Structuring Medical Records with Apache Stanbol · PDF fileStructuring Medical Records with Apache Stanbol ... Stanbol, Apache ManifoldCF ... Cassandra Reference Cases

Stanbol Enhancer

RDF

Page 15: Structuring Medical Records with Apache Stanbol · PDF fileStructuring Medical Records with Apache Stanbol ... Stanbol, Apache ManifoldCF ... Cassandra Reference Cases

Stanbol Enhancer (II)

Page 16: Structuring Medical Records with Apache Stanbol · PDF fileStructuring Medical Records with Apache Stanbol ... Stanbol, Apache ManifoldCF ... Cassandra Reference Cases

Stanbol Enhancer (III)

Page 17: Structuring Medical Records with Apache Stanbol · PDF fileStructuring Medical Records with Apache Stanbol ... Stanbol, Apache ManifoldCF ... Cassandra Reference Cases

Stanbol Enhancement Chains• Define how Content is processed by the Enhancer through an ExecutionPlan • Different Implementations:

• ListChain: in order sequential enhancement engines execution. Parallel Execution of engines not supported

• WeightedChain: ExecutionPlan is calculated using the engines order metadata. Parallel Execution of engines allowed

• API: • /enhancer: executes the default chain • /enhancer/chain/{chain-name}: executes a concrete named chain • /enhancer/engine/{engine-name}: executes a concrete named engine

Page 18: Structuring Medical Records with Apache Stanbol · PDF fileStructuring Medical Records with Apache Stanbol ... Stanbol, Apache ManifoldCF ... Cassandra Reference Cases

Current Enhancement Engines• Preprocessing

• Tika Engine • content type detection • text extraction from several document formats • metadata extraction from several document formats

• Natural Language Processing • Language Detection (different implementations) • Sentence Detection (OpenNLP, SmartCN, REST) • Tokenizer (OpenNLP, SmartCN, REST) • POS Tagging (OpenNLP, REST) • Chunking (OpenNLP, REST) • NER (OpenNLP, OpenCalais, REST)

• Entity Linking • Named Entity Linking • EntityHub Linking Engine • FST (Lucene Finit State Transducer) Linking Engine • Entity Co-mention • Commercial Engines (OpenCalais, Zemanta, CELI…)

• Sentiment Analysis • Disambiguation

• DBPedia Spotlight • Solr MLT based

• PostProcessing: • Dereferencing

Page 19: Structuring Medical Records with Apache Stanbol · PDF fileStructuring Medical Records with Apache Stanbol ... Stanbol, Apache ManifoldCF ... Cassandra Reference Cases

Stanbol EntityHub

Page 20: Structuring Medical Records with Apache Stanbol · PDF fileStructuring Medical Records with Apache Stanbol ... Stanbol, Apache ManifoldCF ... Cassandra Reference Cases

Stanbol EntityHub (II)• Manage Multiple Entity Sources (Knowledge Bases)

• Allows Fast Entity-Lookup using Apache Solr

• Referenced Site (Remote LD + Local Caches) Vs Managed Site (Entity CRUD Api over manually configured Sites)

• API: • Query for Entities (used by Entity Linking Engines)

• CRUD for Managed Sites • LDPath support for:

• Graph Path Retrieval (Used for dereferencing) • Schema Translation • Simple Reasoning

schema:name = rdfs:label[@en];

friend-names = foaf:knows/foaf:name

curl -X POST -d "name=lyon&limit=10" \ http://localhost:8080/entityhub/site/dbpedia/find

Page 21: Structuring Medical Records with Apache Stanbol · PDF fileStructuring Medical Records with Apache Stanbol ... Stanbol, Apache ManifoldCF ... Cassandra Reference Cases

Use Case: Hexin Project - Structuring Medical Records

• R&D Project for Sergas (Galician Public Health Office) • Clinical Data Analysis Platform for supporting:

• Clinical Assistance • Epidemiology studies • Medical Research

• Big Data approach for analyzing both structured historical clinical data and unstructured medical records

• Medical Records are written in Spanish and Galician

Page 22: Structuring Medical Records with Apache Stanbol · PDF fileStructuring Medical Records with Apache Stanbol ... Stanbol, Apache ManifoldCF ... Cassandra Reference Cases

Hexin: Architecture

Validation AnalysisPatient

Data Source

URX

ETL

BIG DATA (HDFS +

HIVE)

Event Detection Process

Cassandra

Reference Cases Detection Process

New Case

BIPatientId Date Structured Events Semantic Events Symptoms: • Cough • Unrest

Unrest Cough Fever>38

Rules

Page 23: Structuring Medical Records with Apache Stanbol · PDF fileStructuring Medical Records with Apache Stanbol ... Stanbol, Apache ManifoldCF ... Cassandra Reference Cases

Hexin: Semantic Tagging

Page 24: Structuring Medical Records with Apache Stanbol · PDF fileStructuring Medical Records with Apache Stanbol ... Stanbol, Apache ManifoldCF ... Cassandra Reference Cases

Hexin: Objective

“Paciente diabético desde los 5 años y con EPOC moderada grado 2 de la GOLD”

Page 25: Structuring Medical Records with Apache Stanbol · PDF fileStructuring Medical Records with Apache Stanbol ... Stanbol, Apache ManifoldCF ... Cassandra Reference Cases

Hexin:Solution Design

• Structure Medical Records using Apache Stanbol Enhancer • Custom Ontology:

• Symptoms • Diseases • Diagnosis Tests • Family and Personal History

• Custom Enhancement Chain: • Language Detection > NLP > Entity Linking > Negation

Detection > Fact Extraction

Page 26: Structuring Medical Records with Apache Stanbol · PDF fileStructuring Medical Records with Apache Stanbol ... Stanbol, Apache ManifoldCF ... Cassandra Reference Cases

Hexin: Ontology

Page 27: Structuring Medical Records with Apache Stanbol · PDF fileStructuring Medical Records with Apache Stanbol ... Stanbol, Apache ManifoldCF ... Cassandra Reference Cases

Hexin: Ontology Indexing

• For supporting the Entity Linking process against Hexin Ontology, an EntityHub site must be created

• 2 options: • ManagedSite: full CRUD storage <-> DYNAMIC • ReferencedSite: READ-ONLY remote site + local index

• Stanbol EntityHub Indexing Tool: • RDF —> JenaTDB —> Solr Index

• Configure Custom Namespaces, Mappings and Properties • Generates an OSGi Bundle with the Yard and YardSite default

configurations • Copy the index to Stanbol /datafiles folder and install the bundle

using Apache Felix OSGi Web Console

hexin:*hexin:label > rdfs:label

Page 28: Structuring Medical Records with Apache Stanbol · PDF fileStructuring Medical Records with Apache Stanbol ... Stanbol, Apache ManifoldCF ... Cassandra Reference Cases

NegexFact Extract.Hexin Linking

Hexin: Enhancement Chain

OpenNLP-ChunkerOpenNLP-POSOpenNLP-TokenOpenNLP-Sent.Lang. Detect.

Custom Hexin Engine. Implemented for the project

Entity Linking Engine. Available in Stanbol with a Custom Configuration for this use case

NLP Engines. Available in Stanbol. Default Configuration

Pre-Processing Engine. Available in Stanbol

Page 29: Structuring Medical Records with Apache Stanbol · PDF fileStructuring Medical Records with Apache Stanbol ... Stanbol, Apache ManifoldCF ... Cassandra Reference Cases

Hexin: Linking

Page 30: Structuring Medical Records with Apache Stanbol · PDF fileStructuring Medical Records with Apache Stanbol ... Stanbol, Apache ManifoldCF ... Cassandra Reference Cases

Hexin: Linking (II)

Page 31: Structuring Medical Records with Apache Stanbol · PDF fileStructuring Medical Records with Apache Stanbol ... Stanbol, Apache ManifoldCF ... Cassandra Reference Cases

Hexin: Custom Engines

@Component@Servicepublic class MyEngine implements EnhancementEngine {

@Activate public void activate(ComponentContext c) { // initialize, configure, ... }

public int canEnhance(ContentItem item) { if(...item matches our expectations...) { return ENHANCE_SYNCHRONOUS; } else { return CANNOT_ENHANCE; } }

public void computeEnhancements(ContentItem item) { // run the engine and add results to item’s // RDF graph based on the item’s InputStream }}

maven-bundle-plugin

adds OSGI metadata

Maven build

maven-scr-pluginadds services metadata

registered by OSGi

MyEngineService

MANIFEST.MF

OSGi metadata

OSGi bundle

Install in Stanbol no restart

needed

Page 32: Structuring Medical Records with Apache Stanbol · PDF fileStructuring Medical Records with Apache Stanbol ... Stanbol, Apache ManifoldCF ... Cassandra Reference Cases

NLP at Apache Stanbol

Page 33: Structuring Medical Records with Apache Stanbol · PDF fileStructuring Medical Records with Apache Stanbol ... Stanbol, Apache ManifoldCF ... Cassandra Reference Cases

NLP at Apache Stanbol (II)• Browsable Map with Spans

• Spans sorted by Natural Order • Iterator based API that allows

concurrent Modifications • Annotations supported at Spans Level

• POS Annotation • PosTag

tag (e.g. NE) lexical category (e.g. Noun)

• Phrase Annotation (chunks) • PhraseTag

tag (e.g. NP) lexical-category (e.g. NounPhrase)

• Sentiment Annotation • SentimentTag:: Double

Stanbol is an Amazing Tool

Sentence

Chunk

Token

Span Types: • Token • Chunk • Sentence • Text Section • Analyzed Text

Page 34: Structuring Medical Records with Apache Stanbol · PDF fileStructuring Medical Records with Apache Stanbol ... Stanbol, Apache ManifoldCF ... Cassandra Reference Cases

Hexin Custom Engine: Negex

• Context/Negex: Algorithm for Negation Detection • Based on Triggers-Terms + Regex

Chapman WW, Bridewell W, Hanbury P, Cooper GF, Buchanan BG. A simple algorithm for identifying negated findings and diseases in discharge summaries. J Biomed Inform. Oct 2001;34(5):301-310.

public abstract class AbstractNegexDetector implements NegexDetector {

@Overridepublic Set<IRI> detectNegations(String language, Graph metadata, AnalysedText at) throws NegexException{}

protected abstract boolean isNegated(String language, String concept, String sentence);

}

Page 35: Structuring Medical Records with Apache Stanbol · PDF fileStructuring Medical Records with Apache Stanbol ... Stanbol, Apache ManifoldCF ... Cassandra Reference Cases

Hexin Custom Engine: Negex (II)• Triggers Types:

• Pre-condition Negation terms (e.g. absence of) • Pseudo Negation terms (e.g. no increase) • Pre-condition possibility phrase (e.g. rule him out) • Post-condition negation terms (e.g. unlikely) • Termination terms (e.g. but, however)

• Implementation available under Apache License 2.0 • Engine Implementation Challenges:

• Entity Annotations as Targets • AnalyzedText and EntityAnnotations relationships are currently obfuscated • GLUE CODE for locating Entity Annotations Spans by using START - END Text

Annotations properties • Once Entity Annotation sentence is located, is used as context along with the Entity

surface-form (mention) for applying the algorithm • Negation Returned as a Custom Property for the TextAnnotation (negated = True or False)

Page 36: Structuring Medical Records with Apache Stanbol · PDF fileStructuring Medical Records with Apache Stanbol ... Stanbol, Apache ManifoldCF ... Cassandra Reference Cases

Hexin Custom Engine: Fact Extraction

“Paciente diabético desde los 5 años y con EPOC moderada grado 2 de la GOLD”

Page 37: Structuring Medical Records with Apache Stanbol · PDF fileStructuring Medical Records with Apache Stanbol ... Stanbol, Apache ManifoldCF ... Cassandra Reference Cases

Hexin Custom Engine: Fact Extraction (II)

• In-Context Entity Fact Extraction • Facts returned as Entity RDF Metadata like the rest of Entity

Properties • Different Implementations of Context (all extracted from

AnalyzedText structure) • Sentence Context (default and usually enough) • Window of Text Context • Paragraph Context

• Rule Based Approach: • Regex over RAW Text or POS tags Sequence

• ENTITY reserved word -> OR expression for all ENTITY labels

Page 38: Structuring Medical Records with Apache Stanbol · PDF fileStructuring Medical Records with Apache Stanbol ... Stanbol, Apache ManifoldCF ... Cassandra Reference Cases

Hexin Custom Engine: Fact Extraction (III)

• Supported Expressions: • diabetes|diabético|DM desde los N años • diabetes|diabético|DM a los N años • Debut diabetes|diabético|DM a los N años

Page 39: Structuring Medical Records with Apache Stanbol · PDF fileStructuring Medical Records with Apache Stanbol ... Stanbol, Apache ManifoldCF ... Cassandra Reference Cases

Hexin Custom Engine: Fact Extraction (IV)

• POS based Rules: Diabetes diagnosed when he was 5 years old

NNS VB WRB PRP VBD CD NNS JJ ENTITY \s VB * VB[be] (CD) years old or simply

ENTITY \s VB * VB[be] (CD)

Page 40: Structuring Medical Records with Apache Stanbol · PDF fileStructuring Medical Records with Apache Stanbol ... Stanbol, Apache ManifoldCF ... Cassandra Reference Cases

Thanks for your attention!