21
Hybrid semantic document enrichment using machine learning and linguistics Stefan Geißler, SEMANTICS, Leipzig Sept 14 2016 Expert System

Stefan Geißler | Hybrid semantic document enrichment using machine learning and linguistics - The Cogito Studio

Embed Size (px)

Citation preview

Page 1: Stefan Geißler | Hybrid semantic document enrichment using machine learning and linguistics - The Cogito Studio

Hybrid semantic document enrichment using machine learning and linguisticsStefan Geißler, SEMANTICS, Leipzig Sept 14 2016

Expert System

Page 2: Stefan Geißler | Hybrid semantic document enrichment using machine learning and linguistics - The Cogito Studio

• TitleWhat is this?

A graph showing the distribution of large cities in the world

Size of the city (population)

The city‘s rank

Page 3: Stefan Geißler | Hybrid semantic document enrichment using machine learning and linguistics - The Cogito Studio

• TitleWhat is this?

A graph showing the richest people of the world

Wealth of the person

The person‘s rank

Page 4: Stefan Geißler | Hybrid semantic document enrichment using machine learning and linguistics - The Cogito Studio

• TitleWhat is this?

A graph showing the most frequent words from a large text corpus

Frequency of the word

The word‘s rank

Page 5: Stefan Geißler | Hybrid semantic document enrichment using machine learning and linguistics - The Cogito Studio

Empirical evidence: Many types of data from physics, social sciences etc follow such a distribution

„Zipf‘s law“:

The number of data points (cities, rich people, words) with a value higher than S (on the y axis) is proportional to 1/S.

Page 6: Stefan Geißler | Hybrid semantic document enrichment using machine learning and linguistics - The Cogito Studio
Page 7: Stefan Geißler | Hybrid semantic document enrichment using machine learning and linguistics - The Cogito Studio

• TitleDistribution of categories in many categorized/tagged corpora

Frequency of the category

The category‘s rank

Page 8: Stefan Geißler | Hybrid semantic document enrichment using machine learning and linguistics - The Cogito Studio

Problem #1:

How does that fit the requirement at the start of many categorization projects that a category will need a decent amount of data (>100 documents) to be trained?

Larger categories can be trained (learned automatically) smaller ones often can‘t.

Page 9: Stefan Geißler | Hybrid semantic document enrichment using machine learning and linguistics - The Cogito Studio

Problem #2:

Even for the frequent enough categories: Is a training corpus really representative?

Is „Greece“ always about „debt crisis“?Is „Ansbach“ always about „terror“?

Learning method may learn unwanted associations

Page 10: Stefan Geißler | Hybrid semantic document enrichment using machine learning and linguistics - The Cogito Studio

• TitleSolution?

More data? No because,- The graph here is

scale-free- More data is often

not available or very costly

Frequency of the category

The category‘s rank

Page 11: Stefan Geißler | Hybrid semantic document enrichment using machine learning and linguistics - The Cogito Studio

Solution: Let the human expert refine the automatically created modelHuman document categorization:

If („Etna“ or „Vesuv“ or „Pinantubo“) AND („lava“ or „eruption“)

Then „Volcanism“

Machine document categorization:

Page 12: Stefan Geißler | Hybrid semantic document enrichment using machine learning and linguistics - The Cogito Studio

This is seldomly a subject in scientific work on document categorization.

Different classification methods most often compared only on the basis of their (automatic) performance on a evaluation corpus

Page 13: Stefan Geißler | Hybrid semantic document enrichment using machine learning and linguistics - The Cogito Studio

… but this is often a requirement in real-world document categorization projects.

• Training corpora alone are often not enough to attained expected levels of quality.

• Additional data hard to find (manual preparation or curation very costly)

• Existing corpora may not always be representative.

Page 14: Stefan Geißler | Hybrid semantic document enrichment using machine learning and linguistics - The Cogito Studio

Our suggestion

• Use available training data to train a model

• Make the model available in a human readable formal language

• Allow user to inspect and refine model where needed in a dedicated developement&testing environment

Page 15: Stefan Geißler | Hybrid semantic document enrichment using machine learning and linguistics - The Cogito Studio

• A rich formal language (strings, lemmas, regexps, semantic concepts, operators …) allows to express learnt associations for bag of words models

• … as well as detailed syntactic/semantic constraints

• … and visualize and evaluated the result in the same application

Page 16: Stefan Geißler | Hybrid semantic document enrichment using machine learning and linguistics - The Cogito Studio

• For the reasons explained above, the statistical learning approach may erroneously learn a rule that the words „Athens“ or „Greece“ allone justify assigning the document to „Banking Crisis“

• The user can refine the learnt rule, adding the further constraint that features like „Debt“, „Schäuble“ or „Troika“ are required before the category is assigned.

Page 17: Stefan Geißler | Hybrid semantic document enrichment using machine learning and linguistics - The Cogito Studio

… Sample projects

• <US Media company>• Large category schema for news articles • Task: set up solution that allows combining

automatically created rule sets with manual refinement

• <Insurance company>• Categorize medical reports using ICD category

scheme• Go beyond quality that can be attained by using

only the manually coded training set

Page 18: Stefan Geißler | Hybrid semantic document enrichment using machine learning and linguistics - The Cogito Studio

Conclusion• Requirements in categorization projects in the

industry are sometimes not identical to the scenarios in academic categorization benchmarks

• Available training data sometimes limited even in the age of big data

• Allow the seamless (one language, one development environment) application of both learnt as well as manually crafted rules

Page 19: Stefan Geißler | Hybrid semantic document enrichment using machine learning and linguistics - The Cogito Studio

Expert SystemWho we are

Page 20: Stefan Geißler | Hybrid semantic document enrichment using machine learning and linguistics - The Cogito Studio

Expert System: Largest European provider of pure semantic technologies

• 7 Geographies• 250+ team members• Listed on the AIM exchange• Recommended by Gartner,

Forrester, IDC ...

• Experiences from hundreds of projects

• Award winning technology: Taxonomy / Ontology Management, NLP, Information extraction, Question Answering, Cognitive Computing