Termite & Termite Expressions

Preview:

DESCRIPTION

Termite is a next-generation text-mining and semantic markup engine for life sciences. Designed to be used by researchers and content providers alike, we enhance unstructured text and find the topics and relationships that matter!

Citation preview

© 2013 SciBite Limited. © 2013 SciBite Limited

Termite and

Termite-Express

SciBite

© 2013 SciBite Limited

Powerful Text-Indexing For Life Sciences

© 2013 SciBite Limited. © 2013 SciBite Limited

Whats The Issue?

So much public, private, professional textual content available…

Standard text-search tools don’t help much because they aren’t semantic…

Semantic means searching by “thing” not by synonym (keyword)…

Semantic means more accurate and complete results!

© 2013 SciBite Limited. © 2013 SciBite Limited

Users & ApplicationsResearchers

You Are: A life science professional who’s job involves hunting for key facts in literature, patents, grants and internal documents

We Offer: The ability to data-mine millions of documents to identify critical mentions and relationships

Enterprise Search

You Are: A company wishing to make its internal search portals more accurate

Content Providers

You Are: Anyone who produces or supplies textual content in the life-sciences

We Offer: The ability to enhance your existing search tool to find key biological entities more accurately, making your users happier and more productive!

We Offer: The opportunity to enrich your content for search, navigation and significantly increase the value to your consumers

© 2013 SciBite Limited. © 2013 SciBite Limited

Selecting A Semantic Recognition Engine

Is commercially supported Is highly configurable Is accurate Is scalable (millions of documents) Is fast (MB/sec processing) Is flexible (abstracts to full documents) Supports batch & on-demand (web service) processing Is tuned to life sciences data Comes supplied with highly curated thesauri Comfortable with ambiguity of life science texts Goes beyond recognition to identify critical phrases in a document

ChoicesYou want something that:

Termite meets all these criteria

© 2013 SciBite Limited. © 2013 SciBite Limited

Semantic Entity Recognition : Basics Two main approaches,

Thesaurus: match text to a list of known synonyms Algorithmic: try to identify an entity synonym “on the fly”

Termite uses both mechanisms to identify entities with high accuracy Thesauri are often an afterthought from tool providers, pointing to free

public sources While these are good starting points, they will deliver variable results Our view:

Commercial grade text-mining requires commercial grade thesauri

© 2013 SciBite Limited. © 2013 SciBite Limited

Our Thesauri Products Thesauri are at our heart, not an afterthought:

Combine crowd-sourced and professional curation with experienced biomedical/pharma researchers

Thesauri are built to tackle real world text Integration-ready: We use public identifiers by

default Include mappings to other resources and many are

organised via ontologies

© 2013 SciBite Limited. © 2013 SciBite Limited

Some Examples Human Gene

We have over 4.5 million synonyms, and when combined with our on-the-fly algorithms, we match over 30 million gene name mentions

Indication (Disease) We have extensive coverage of over 5000 of the most important human diseases, along with over 63,000

manually verified synonyms Protein Type

Recognises concepts such as “interluekin”, “cytokine”, “ion channel”, rather than specific genes. Arguably these terms are used more often in biomedical text than gene names yet such entities are very poorly identified by other tools

Drugs Recognise over 1 million synonyms covering >60,000 launched and research therapeutics. Updated on a

daily basis from our internet-wide scanning at SciBite.com

We also cover: Adverse Events, Cells, Tissues, Species (and species-specific gene thesauri), Companies, Micro RNAs,

Mutations, Hormones & Messengers, Investigative procedures (e.g. Biopsy), Laboratory Chemicals, Laboratory Procedures, Restriction Enzymes, Plasmids, General Laboratory Products & more!

© 2013 SciBite Limited. © 2013 SciBite Limited

Synonyms Aren’t Easy.. Biomedical terms are very ambiguous

GSK (GlaxoSmithkiline or Glucose Synthase Kinase?) Hedgehog (Animal or developmental regulator protein?) Android (The FDA approved drug or the Phone OS?) Transgene (The company or the technique?) MCD (macular dystrophy (corneal) or malformations of cortical

development) Pacific (Pacific Biotechnology or the ocean?) EGFR (The kinase receptor or e-glomerular filtration rate?)

© 2013 SciBite Limited. © 2013 SciBite Limited

Ambiguity: Termites Strength Termite’s engine and thesauri understand which synonyms are

Fairly Dependable (e.g. Pfizer), Often Ambiguous (e.g. MCD) or correct but very dangerous (e.g. Pacific)

As a document is analysed, Termite uses both: Synonym Range: Which synonyms are used, how ambiguous as a

whole, not just one-by-one? Synonym Metrics: Frequency and position of synonyms, relationship

of abbreviations and full terms Document context: Does the document mention key terms (but not

synonyms) that increase or decrease the chances the ambiguous synonym is correct

© 2013 SciBite Limited. © 2013 SciBite Limited

Bottom Line

Termite allows you to use ambiguous synonyms in your

Thesauri to increase recall without returning a lot of

rubbish!

© 2013 SciBite Limited. © 2013 SciBite Limited

β-actin

actin-β

b-actin

β actin

Beta actin

ß-actin

Actin, beta

b- actin

The German Sharp isnt beta but that doesn’t stop people using it

Including HTML Entity codes!

Termite handles Greek characters

with ease

© 2013 SciBite Limited. © 2013 SciBite Limited

Muscarinic M1 Receptor(s)

Muscarinic (M1) Receptor

M1 Muscarinic Receptor

Muscarinic Receptor M1

Muscarinic Receptors M1

Muscarinic Receptor type M1

The usual variations….

Termite handles variations with ease

© 2013 SciBite Limited. © 2013 SciBite Limited

M1/M2 muscarinic receptors

H1 and H2 Histamine Receptors

Kinases ERK1 and 2

ERK1/2

Termite handles “broken” phrases

with ease

© 2013 SciBite Limited. © 2013 SciBite Limited

MANY MORE EXAMPLES

HTTP://WWW.SLIDESHARE.NET/SCIBITELY/TERMITE-DEALING-WITH-REAL-WORLD

© 2013 SciBite Limited. © 2013 SciBite Limited

88%

7%5%

Accuracy Of Termite On Random Selection Of 400 Entries From Biocreative Gene-Mention

Task

CorrectDiasagreementIncorrect

http://biocreative.sourceforge.net/

© 2013 SciBite Limited. © 2013 SciBite Limited

WebConnect – “Termite Live”

http://scibite.com/site/p3/webconnect.html

© 2013 SciBite Limited. © 2013 SciBite Limited

TEXPRESS

Going Beyond Recognition

© 2013 SciBite Limited. © 2013 SciBite Limited

NER, Patterns, NLP Termite is a Named Entity Recognition (NER) engine – it finds mentions of

“things” in text Natural Language Processing (NLP) is an area of linguistics that seeks to

develop a computer-understandable representation of human text NLP is both powerful and complex. Human language can vary greatly,

and results in many facets to consider in NLP results Critically, many use-cases do not require full NLP, users wish to simply

“identify any relationships between entities in the text” Texpress uses “patterns” to achieve this

© 2013 SciBite Limited. © 2013 SciBite Limited

An Example Use case: Scan an input set of documents and identify disease-gene relationships within

the text and output these to a file for downstream processing We supply a simple pattern Indication{0,3}(Gene|Protein_Class), which means:

Find an indication Followed by 0-3 other words And then a gene or protein class. Its critical to use the “(Gene|Protein_Class)” when looking for

gene/protein info as often classes are used (see purple text below).

For example, on the text:“Simvastatin induces heme oxygenase-1 expression but fails to reduce inflammation in the capsule surrounding a silicone shell implant in rats”

[DRUG:CHEMBL1064]simvastatin [VERB:!INDUCES]induces [GENE:HMOX1]heme _oxygenase _1 [VERB:!EXPRESSION]expression but {NEG}fails to [VERB:!REDUCE]reduce [INDICATION:D007249]inflammation in the capsule surrounding a silicone shell implant in [ORG:RAT]rats

© 2013 SciBite Limited. © 2013 SciBite Limited

Identifying Causal Relationships E.g. we want to look for drugs that treat Lymphocytic Choriomeningitis Virus (LCV) We use the pattern: DRUG.{0,1}:treat.{0,1}:INDICATION(D001117) Which is translated as:

Find any drug in close proximity to the verb “treat” Followed closely by the specific indication (D001117 is the ID for LCV)

From the following text, we obtain the computer-readable result below: “To investigate its therapeutic potential, we used rapamycin to treat Lymphocytic Choriomeningitis Virus

(LCMV)-infected perforin-deficient (Prf1(-/-) ) mice according to a well-established model of HLH”

[DRUG:CHEMBL413]rapamycin to [VERB:!TREAT]treat [D001117]lymphocytic _choriomeningitis _virus

© 2013 SciBite Limited. © 2013 SciBite Limited

Other features Extension (pattern will match multiple entities in a list)

<Indication><gene> will find all genes in Cancer due to mutations in p53, SCA1 and BRCA1

Negativity TExpress will note where the extracted phrase contains negative keywords

or sentiment Verb Extraction

Identify causal/action relationships and return the verb used i.e. <gene> <any_verb> <gene> on “p53 binds mdm2” => binds

Auto-continuation We’ll match multiple entities of the same type in a list in the pattern (e.g.

matching both drugs in the phrase “cancer can be treated with paclitaxel and bortezomib”) using an “<INDICATION><DRUG>” pattern

© 2013 SciBite Limited. © 2013 SciBite Limited

Why TExpress? Built on Termite with all its advantages (quality thesauri,

ambiguity processing, coverage) Simple patterns, easy to create and understand High performance/scalability (around 10% slower than Termite

alone) Supports narrow focus (e.g. ‘<Gene1> inhibits <Gene2>’) and

wide focus (e.g. “<Gene1> <any_verb> <Gene2>”) relationships

Simple JSON, TSV or XML output

© 2013 SciBite Limited. © 2013 SciBite Limited

Want to know more?

Ask us for a demo today!

Email: info@scibite.com

Twitter: @scibitely

Call Us: +44 (0)20 8819 2776

Recommended