Stefan Geißler | Hybrid semantic document enrichment using machine learning and linguistics - The Cogito Studio

Hybrid semantic document enrichment using machine learning and linguisticsStefan Geißler, SEMANTICS, Leipzig Sept 14 2016

Expert System

• TitleWhat is this?

A graph showing the distribution of large cities in the world

Size of the city (population)

The city‘s rank


A graph showing the richest people of the world

Wealth of the person

The person‘s rank


A graph showing the most frequent words from a large text corpus

Frequency of the word

The word‘s rank

Empirical evidence: Many types of data from physics, social sciences etc follow such a distribution

„Zipf‘s law“:

The number of data points (cities, rich people, words) with a value higher than S (on the y axis) is proportional to 1/S.

• TitleDistribution of categories in many categorized/tagged corpora

Frequency of the category

The category‘s rank

Problem #1:

How does that fit the requirement at the start of many categorization projects that a category will need a decent amount of data (>100 documents) to be trained?

Larger categories can be trained (learned automatically) smaller ones often can‘t.

Problem #2:

Even for the frequent enough categories: Is a training corpus really representative?

Is „Greece“ always about „debt crisis“?Is „Ansbach“ always about „terror“?

Learning method may learn unwanted associations

• TitleSolution?

More data? No because,- The graph here is

scale-free- More data is often

not available or very costly

Frequency of the category

The category‘s rank

Solution: Let the human expert refine the automatically created modelHuman document categorization:

If („Etna“ or „Vesuv“ or „Pinantubo“) AND („lava“ or „eruption“)

Then „Volcanism“

Machine document categorization:

This is seldomly a subject in scientific work on document categorization.

Different classification methods most often compared only on the basis of their (automatic) performance on a evaluation corpus

… but this is often a requirement in real-world document categorization projects.

• Training corpora alone are often not enough to attained expected levels of quality.

• Additional data hard to find (manual preparation or curation very costly)

• Existing corpora may not always be representative.

Our suggestion

• Use available training data to train a model

• Make the model available in a human readable formal language

• Allow user to inspect and refine model where needed in a dedicated developement&testing environment

• A rich formal language (strings, lemmas, regexps, semantic concepts, operators …) allows to express learnt associations for bag of words models

• … as well as detailed syntactic/semantic constraints

• … and visualize and evaluated the result in the same application

• For the reasons explained above, the statistical learning approach may erroneously learn a rule that the words „Athens“ or „Greece“ allone justify assigning the document to „Banking Crisis“

• The user can refine the learnt rule, adding the further constraint that features like „Debt“, „Schäuble“ or „Troika“ are required before the category is assigned.

… Sample projects

• <US Media company>• Large category schema for news articles • Task: set up solution that allows combining

automatically created rule sets with manual refinement

• <Insurance company>• Categorize medical reports using ICD category

scheme• Go beyond quality that can be attained by using

only the manually coded training set

Conclusion• Requirements in categorization projects in the

industry are sometimes not identical to the scenarios in academic categorization benchmarks

• Available training data sometimes limited even in the age of big data

• Allow the seamless (one language, one development environment) application of both learnt as well as manually crafted rules

Expert SystemWho we are

Expert System: Largest European provider of pure semantic technologies

• 7 Geographies• 250+ team members• Listed on the AIM exchange• Recommended by Gartner,

Forrester, IDC ...

• Experiences from hundreds of projects

• Award winning technology: Taxonomy / Ontology Management, NLP, Information extraction, Question Answering, Cognitive Computing

Global Positioning – Selected Clients

21

ENERGY, OIL & GAS

GOVERNMENT

FEDERAL AGENCIES

MEDIA & PUBLISHING

Life Sciences

FINANCE

http://upload.wikimedia.org/wikipedia/it/1/1b/Finmeccanica-logo.gif

















Technology

Stefan Geißler | Hybrid semantic document enrichment using machine learning and linguistics - The Cogito Studio