23
© 2013 SciBite Limite © 2013 SciBite Limited Termite and Termite-Express SciBite © 2013 SciBite Limited Powerful Text-Indexing For Life Sciences

Termite & Termite Expressions

Embed Size (px)

DESCRIPTION

Termite is a next-generation text-mining and semantic markup engine for life sciences. Designed to be used by researchers and content providers alike, we enhance unstructured text and find the topics and relationships that matter!

Citation preview

Page 1: Termite & Termite Expressions

© 2013 SciBite Limited. © 2013 SciBite Limited

Termite and

Termite-Express

SciBite

© 2013 SciBite Limited

Powerful Text-Indexing For Life Sciences

Page 2: Termite & Termite Expressions

© 2013 SciBite Limited. © 2013 SciBite Limited

Whats The Issue?

So much public, private, professional textual content available…

Standard text-search tools don’t help much because they aren’t semantic…

Semantic means searching by “thing” not by synonym (keyword)…

Semantic means more accurate and complete results!

Page 3: Termite & Termite Expressions

© 2013 SciBite Limited. © 2013 SciBite Limited

Users & ApplicationsResearchers

You Are: A life science professional who’s job involves hunting for key facts in literature, patents, grants and internal documents

We Offer: The ability to data-mine millions of documents to identify critical mentions and relationships

Enterprise Search

You Are: A company wishing to make its internal search portals more accurate

Content Providers

You Are: Anyone who produces or supplies textual content in the life-sciences

We Offer: The ability to enhance your existing search tool to find key biological entities more accurately, making your users happier and more productive!

We Offer: The opportunity to enrich your content for search, navigation and significantly increase the value to your consumers

Page 4: Termite & Termite Expressions

© 2013 SciBite Limited. © 2013 SciBite Limited

Selecting A Semantic Recognition Engine

Is commercially supported Is highly configurable Is accurate Is scalable (millions of documents) Is fast (MB/sec processing) Is flexible (abstracts to full documents) Supports batch & on-demand (web service) processing Is tuned to life sciences data Comes supplied with highly curated thesauri Comfortable with ambiguity of life science texts Goes beyond recognition to identify critical phrases in a document

ChoicesYou want something that:

Termite meets all these criteria

Page 5: Termite & Termite Expressions

© 2013 SciBite Limited. © 2013 SciBite Limited

Semantic Entity Recognition : Basics Two main approaches,

Thesaurus: match text to a list of known synonyms Algorithmic: try to identify an entity synonym “on the fly”

Termite uses both mechanisms to identify entities with high accuracy Thesauri are often an afterthought from tool providers, pointing to free

public sources While these are good starting points, they will deliver variable results Our view:

Commercial grade text-mining requires commercial grade thesauri

Page 6: Termite & Termite Expressions

© 2013 SciBite Limited. © 2013 SciBite Limited

Our Thesauri Products Thesauri are at our heart, not an afterthought:

Combine crowd-sourced and professional curation with experienced biomedical/pharma researchers

Thesauri are built to tackle real world text Integration-ready: We use public identifiers by

default Include mappings to other resources and many are

organised via ontologies

Page 7: Termite & Termite Expressions

© 2013 SciBite Limited. © 2013 SciBite Limited

Some Examples Human Gene

We have over 4.5 million synonyms, and when combined with our on-the-fly algorithms, we match over 30 million gene name mentions

Indication (Disease) We have extensive coverage of over 5000 of the most important human diseases, along with over 63,000

manually verified synonyms Protein Type

Recognises concepts such as “interluekin”, “cytokine”, “ion channel”, rather than specific genes. Arguably these terms are used more often in biomedical text than gene names yet such entities are very poorly identified by other tools

Drugs Recognise over 1 million synonyms covering >60,000 launched and research therapeutics. Updated on a

daily basis from our internet-wide scanning at SciBite.com

We also cover: Adverse Events, Cells, Tissues, Species (and species-specific gene thesauri), Companies, Micro RNAs,

Mutations, Hormones & Messengers, Investigative procedures (e.g. Biopsy), Laboratory Chemicals, Laboratory Procedures, Restriction Enzymes, Plasmids, General Laboratory Products & more!

Page 8: Termite & Termite Expressions

© 2013 SciBite Limited. © 2013 SciBite Limited

Synonyms Aren’t Easy.. Biomedical terms are very ambiguous

GSK (GlaxoSmithkiline or Glucose Synthase Kinase?) Hedgehog (Animal or developmental regulator protein?) Android (The FDA approved drug or the Phone OS?) Transgene (The company or the technique?) MCD (macular dystrophy (corneal) or malformations of cortical

development) Pacific (Pacific Biotechnology or the ocean?) EGFR (The kinase receptor or e-glomerular filtration rate?)

Page 9: Termite & Termite Expressions

© 2013 SciBite Limited. © 2013 SciBite Limited

Ambiguity: Termites Strength Termite’s engine and thesauri understand which synonyms are

Fairly Dependable (e.g. Pfizer), Often Ambiguous (e.g. MCD) or correct but very dangerous (e.g. Pacific)

As a document is analysed, Termite uses both: Synonym Range: Which synonyms are used, how ambiguous as a

whole, not just one-by-one? Synonym Metrics: Frequency and position of synonyms, relationship

of abbreviations and full terms Document context: Does the document mention key terms (but not

synonyms) that increase or decrease the chances the ambiguous synonym is correct

Page 10: Termite & Termite Expressions

© 2013 SciBite Limited. © 2013 SciBite Limited

Bottom Line

Termite allows you to use ambiguous synonyms in your

Thesauri to increase recall without returning a lot of

rubbish!

Page 11: Termite & Termite Expressions

© 2013 SciBite Limited. © 2013 SciBite Limited

β-actin

actin-β

b-actin

β actin

Beta actin

ß-actin

Actin, beta

b- actin

The German Sharp isnt beta but that doesn’t stop people using it

Including HTML Entity codes!

Termite handles Greek characters

with ease

Page 12: Termite & Termite Expressions

© 2013 SciBite Limited. © 2013 SciBite Limited

Muscarinic M1 Receptor(s)

Muscarinic (M1) Receptor

M1 Muscarinic Receptor

Muscarinic Receptor M1

Muscarinic Receptors M1

Muscarinic Receptor type M1

The usual variations….

Termite handles variations with ease

Page 13: Termite & Termite Expressions

© 2013 SciBite Limited. © 2013 SciBite Limited

M1/M2 muscarinic receptors

H1 and H2 Histamine Receptors

Kinases ERK1 and 2

ERK1/2

Termite handles “broken” phrases

with ease

Page 14: Termite & Termite Expressions

© 2013 SciBite Limited. © 2013 SciBite Limited

MANY MORE EXAMPLES

HTTP://WWW.SLIDESHARE.NET/SCIBITELY/TERMITE-DEALING-WITH-REAL-WORLD

Page 15: Termite & Termite Expressions

© 2013 SciBite Limited. © 2013 SciBite Limited

88%

7%5%

Accuracy Of Termite On Random Selection Of 400 Entries From Biocreative Gene-Mention

Task

CorrectDiasagreementIncorrect

http://biocreative.sourceforge.net/

Page 16: Termite & Termite Expressions

© 2013 SciBite Limited. © 2013 SciBite Limited

WebConnect – “Termite Live”

http://scibite.com/site/p3/webconnect.html

Page 17: Termite & Termite Expressions

© 2013 SciBite Limited. © 2013 SciBite Limited

TEXPRESS

Going Beyond Recognition

Page 18: Termite & Termite Expressions

© 2013 SciBite Limited. © 2013 SciBite Limited

NER, Patterns, NLP Termite is a Named Entity Recognition (NER) engine – it finds mentions of

“things” in text Natural Language Processing (NLP) is an area of linguistics that seeks to

develop a computer-understandable representation of human text NLP is both powerful and complex. Human language can vary greatly,

and results in many facets to consider in NLP results Critically, many use-cases do not require full NLP, users wish to simply

“identify any relationships between entities in the text” Texpress uses “patterns” to achieve this

Page 19: Termite & Termite Expressions

© 2013 SciBite Limited. © 2013 SciBite Limited

An Example Use case: Scan an input set of documents and identify disease-gene relationships within

the text and output these to a file for downstream processing We supply a simple pattern Indication{0,3}(Gene|Protein_Class), which means:

Find an indication Followed by 0-3 other words And then a gene or protein class. Its critical to use the “(Gene|Protein_Class)” when looking for

gene/protein info as often classes are used (see purple text below).

For example, on the text:“Simvastatin induces heme oxygenase-1 expression but fails to reduce inflammation in the capsule surrounding a silicone shell implant in rats”

[DRUG:CHEMBL1064]simvastatin [VERB:!INDUCES]induces [GENE:HMOX1]heme _oxygenase _1 [VERB:!EXPRESSION]expression but {NEG}fails to [VERB:!REDUCE]reduce [INDICATION:D007249]inflammation in the capsule surrounding a silicone shell implant in [ORG:RAT]rats

Page 20: Termite & Termite Expressions

© 2013 SciBite Limited. © 2013 SciBite Limited

Identifying Causal Relationships E.g. we want to look for drugs that treat Lymphocytic Choriomeningitis Virus (LCV) We use the pattern: DRUG.{0,1}:treat.{0,1}:INDICATION(D001117) Which is translated as:

Find any drug in close proximity to the verb “treat” Followed closely by the specific indication (D001117 is the ID for LCV)

From the following text, we obtain the computer-readable result below: “To investigate its therapeutic potential, we used rapamycin to treat Lymphocytic Choriomeningitis Virus

(LCMV)-infected perforin-deficient (Prf1(-/-) ) mice according to a well-established model of HLH”

[DRUG:CHEMBL413]rapamycin to [VERB:!TREAT]treat [D001117]lymphocytic _choriomeningitis _virus

Page 21: Termite & Termite Expressions

© 2013 SciBite Limited. © 2013 SciBite Limited

Other features Extension (pattern will match multiple entities in a list)

<Indication><gene> will find all genes in Cancer due to mutations in p53, SCA1 and BRCA1

Negativity TExpress will note where the extracted phrase contains negative keywords

or sentiment Verb Extraction

Identify causal/action relationships and return the verb used i.e. <gene> <any_verb> <gene> on “p53 binds mdm2” => binds

Auto-continuation We’ll match multiple entities of the same type in a list in the pattern (e.g.

matching both drugs in the phrase “cancer can be treated with paclitaxel and bortezomib”) using an “<INDICATION><DRUG>” pattern

Page 22: Termite & Termite Expressions

© 2013 SciBite Limited. © 2013 SciBite Limited

Why TExpress? Built on Termite with all its advantages (quality thesauri,

ambiguity processing, coverage) Simple patterns, easy to create and understand High performance/scalability (around 10% slower than Termite

alone) Supports narrow focus (e.g. ‘<Gene1> inhibits <Gene2>’) and

wide focus (e.g. “<Gene1> <any_verb> <Gene2>”) relationships

Simple JSON, TSV or XML output

Page 23: Termite & Termite Expressions

© 2013 SciBite Limited. © 2013 SciBite Limited

Want to know more?

Ask us for a demo today!

Email: [email protected]

Twitter: @scibitely

Call Us: +44 (0)20 8819 2776