Public SAP HANA Text Analyse Wie böse ist der Wolf? · Dr. Ingo Peter, SAP Österreich GmbH 26 April, 2016 SAP HANA Text Analyse –Wie böse ist der Wolf? Public

Dr. Ingo Peter, SAP Österreich GmbH

26 April, 2016

SAP HANA Text Analyse – Wie böse ist der Wolf?Public

© 2016 SAP SE or an SAP affiliate company. All rights reserved. 2Public

Wie böse ist der Wolf?


Unterstützte Arten der Textanalyse

Search

In addition to string matching,

HANA features full-text search

which works on content stored

in tables or exposed via views.

Just like searching on the

Internet, full-text search

finds terms irrespective of the

sequence of characters and

words.

Text mining

Text mining makes semantic

determinations about the overall

content of documents relative to

other documents. Capabilities

include key term identification

and document categorization.

Text mining is complementary to

text analysis.

Text analysis

Capabilities range from basic

tokenization and stemming to

more complex semantic

analysis in the form of entity

and fact extraction. Text

analysis applies within individual

documents and is the

foundation for both full-text

search and text mining.


SAP HANA Architektur zur Textanalyse

SAP HANA

HANA UI for

Information Access

For search and text

capabilities inside SAP

HANA

Apps on HANA

Applications on any

platform using SQL via

ODBC/JDBC

Store

Tables

Engines

Search

engineAnalytic engine

Extended Application Services

SQL, MDX

HANA Apps

Applications running

natively on / against

SAP HANA Database

Preprocessor

Linguistic

processing

Entity & fact

extraction

Metadata

Model

Studio

Modeler,

dev.

workbench

Search (InA)native REST,

OData


Laden der Dokumente nach SAP HANA

Es wird eine Tabelle bereitgestellt mit Information zu Dokumenten ID, Titel, Sprache, Mime Type, …

Das Dokument selber liegt im Feld content vom Typ BLOB (Binary Large Object)


Datenbeladung - Dokumente

Vier Grimm‘s Märchen im MS Word Format …

Weitere mögliche Formate sind plain Text, pdf,…

Übersicht im View m_text_analysis_mime_types


Datenbeladung - Werkzeuge

Die Beladung erfolgt in diesem Beispiel über ein Python Script: load_docFTaleStore.py

Alternative Möglichkeiten bestehen: ruby, Data Services, u.a.

Eventuell müssen die Textdokumente gar nicht in die HANA geladen werden, sondern es reicht, wenn das

Analyseergebnis (Textanalyse Index) in die Datenbank geladen wird.

Die Dokumente selber können über einen Link zugänglich gemacht werden.


Datenbeladung - Werkzeuge


Datenbeladung - Ergebnis

Vier Grimm‘s Märchen wurden nach SAP

HANA in eine Tabelle geladen


… nochmals … das Ergebnis …


… ist die Visualisierung eines Ratings …

Das Ergebnis ist die Visualisierung dieser Daten

Jedem Akteur in jedem Märchen wurde ein durchschnittliches Rating vergeben, wie gut, wie böse er ist.


Text analysis

An option to the full-text index

Text analysis is defined on a

table column.

It is bound to full-text indexing –

pre-processing steps are shared.

Results are stored in a table.

During pre-processing, a full-text index is created which is ‘attached’ as

a shadow column to the table column indexed. This index can be

accessed only indirectly: when a full-text search is performed.

In contrast, the results of text analysis are stored in the table

$TA_<index_name>.

SAP HANA

Text analysis

results tableSource table

Fu

ll-tex

t ind

ex

Full-text

indexing

with text

analysis


Text analysis


The following steps may be executed on unstructured text to augment

full-text indexing:

Part-of-Speech

Tags word categoriesExamples: quick: Adj; houses: Nn-Pl

Noun groups

Identifies conceptsExamples: text data; global piracy

Entity extraction

Classifies pre-defined entity typesExamples: Winston Churchill: PERSON; U.K.: COUNTRY;

Fact extraction

Relates entities – e.g., classifies sentiments with topicsExample: I love SAP HANA:

[Sentiment] I [StrongPositiveSentiment] love [/StrongPositiveSentiment]

[Topic] SAP HANA [/Topic].[/Sentiment]


Text analysis


SAP HANA

Source table

Full-text

indexing

with text

analysis

Fu

ll-tex

t ind

ex

Text analysis

results table

File format filtering

Language detection

Tokenization

Stemming

Part-of-Speech

Noun groups

Entity extraction

Fact extraction

File format filtering

Language detection

Tokenization

Stemming

Part-of-Speech

Noun groups

Entity extraction

Fact extraction


Extraction analysis output

Topic 4 of 11 | Text analysis: Understanding output and configurations

Primary key(s) from source table are used to link

annotations back to original source row

Type of text analysis output (“Entity Extraction”)

Uniquely identifies each text analysis annotation within a

given source (and text analysis output type)

Word, punctuation, or sequence of words and punctuation

from the input text (“surface form” of the annotation)




Language of input text (mixed-language texts are not supported)

Semantic type of entity/relationship

Standard form for dictionary extractions; otherwise NULL

Not used for extraction output; always NULL




Relative paragraph (number) containing the annotation

Relative sentence (number) containing the annotation

Offset in characters from beginning of input text

Indicates a semantic relationship between this annotation and another annotation from

the same source text. Contains the TA_COUNTER value of the parent word or phrase.


Der Textanalyse Index für unser Beispiel

Für die Tabelle tabFairyTaleStore mit den vier Grimm‘s Märchen wird ein Textanalyse Index erstellt.

Der Textanalyse Index ist eine Tabelle $TA_FT_FairyTaleStore

Beispiel

„Liebe Kinder“ = sentiment

Sentiment ist die Klammer über das topic („Kinder“) und die Bewertung (hier StrongPositiveSentiment, „liebe“)

Die Klammer wird technisch durch den ta_counter und ta_parent beschrieben: ta_parent verweist auf den ta_counter

von sentiment.


Ermittlung des Ratings aus den Standardsentiments

Die Standardbewertungen WPS,

WNS, SPS, SNS, NES, … werden

auf ein Rating abgebildet, und es

wird bei mehrmaligem Auftreten

derselben Entität ein Durchschnitt

gebildet.

Auf den TA Index wird mit SQL

zugegriffen


Textanalyse Modell für unser Beispiel

Ob ein Akteur gut … böse ist, wird durch seine Eigenschaften beschrieben bzw. dadurch, was er tut:

<sentiment>arme Kinder</sentiment>

<topic>Kinder</topic>

<WeakPositiveSentiment>arme</WeakPositiveSentiment> *)

<sentiment>liebe Mutter</sentiment>

<topic>Mutter</topic>

<StrongPositiveSentiment>liebe</StrongPositiveSentiment>

<sentiment>der Fuchs stiehlt</sentiment>

<topic>Fuchs</topic>

<StrongNegativeSentiment>stiehlt</StrongNegativeSentiment>

*) Anpassung des Dictionaries: arm = gut für Märchen !!!


Text Analyse: Entitäten- und Faktenextraktion

Text Analysis strukturiert unstrukturierten Text auf zwei Arten:

Entitäten:

John Lennon was one of the Beatles.

<PERSON>John Lennon</PERSON> was one of the

<ORGANIZATION@ENTERTAINMENT>Beatles</ORGANIZATION@ENTERTAINMENT>.

Fakten:

I love your product.

I <STRONGPOSITIVESENTIMENT>love</STRONGPOSITIVESENTIMENT> <TOPIC>your

product</TOPIC>.


Who: People, job title, and national

identification numbers

What: Companies, organizations,

financial indexes, and products

When: Dates, days, holidays, months,

years, times, and time periods

Where: Addresses, cities, states,

countries, facilities, internet

addresses, and phone numbers

How much: Currencies and units of measure

Generic concepts: text data, global piracy, and so

on

Text Analyse: Was wird bei der Entitätenextraktion unterstützt?

Languages:

Arabic, English, Dutch, Farsi, French, German,

Italian, Japanese, Korean, Portuguese, Russian,

Simplified Chinese, Spanish, Traditional Chinese


Voice of customer

Sentiments: strong positive, weak positive, neutral, weak negative, strong negative, and problems

Requests: general and contact info

Emoticons: strong positive, weak positive, weak negative, strong negative

Profanity: ambiguous and unambiguous

*Emoticons and profanity only

Text Analyse: Faktenextraktion

Languages:

English, Dutch*, French, German, Italian,

Portuguese, Russian, Simplified Chinese, Spanish,

Traditional Chinese


Text Analyse: Faktenextraktion

Enterprise

Membership information

Management changes

Product releases

Mergers & acquisitions

Organizational information

Public Sector

Action & travel events

Military units

Person-alias, -appearance, -attributes, -relationships

Spatial references

Domain-specific entities

Language:

English

Language:

English


Anlegen eines Textanalyse Index

Siehe auch die Standardkonfigurationen: LINGANALYSIS_FULL, EXTRACTION_CORE, VOICEOFCUSTOMER


Der Textanalyse Index als Eigenschaft der Basistabelle


Textanalyse Index – Tabellenstruktur


Steuerung des Textanalyse Index durch eine Konfiguration

Die Konfiguration wird als xml-Datei im HANA

Repository abgelegt.

Eine Konfiguration muss aktiviert werden


Konfigurationdatei hdbtextconfig

In der Konfigurationsdatei (*.hdbtextconfig) wird dann auf

Dictionaries (*.hdbtextdict) und

Regeldateien (*.hdbtextrule) verwiesen


Konfiguration - Dictionaries

Dictionaries können und müssen an die Anforderungen angepasst

werden

PERSON ist eine Standardentität

Für Märchen wird diese Standardentität erweitert durch typische

Märchenakteure (Entitäten): Hexe, Bauer, Zauberer, Königin,

Stiefmutter, Wolf, …

Entitäten können Varianten besitzen: Der Wolf heißt auch

Isegrimm, Nimmersatt, Bösewicht, …

Für spezielle Anwendungen (z.B. Medizin: ICD Codes) können

Dictionaries z.B. auch von Partnern ausgeliefert werden.

Dictionaries sind xml-Dateien, die im HANA Repository aktiviert

werden müssen.


Konfiguration – Dictionaries – Ergebnis im TA Index

Im Textanalyse Index stehen dann unter ta_token die

Varianten, unter ta_normalized die Entitäten (siehe hier im

Beispiel für die Entität PERSON).


Konfiguration – Dictionaries für Sentiments

Auch Sentiments werden in Dictionaries definiert

Ein Standarddictionary für Sentiments wird auch

mit SAP HANA ausgeliefert.

Anpassung der Dictionaries an spezielle

Anforderungen: z.B. können Begriffe wie

schwach und krank in Märchen durchaus positiv

besetzt sein.

Best Practice: Standard Dictionary kopieren und

anpassen


Standardbewertungen für Sentiments

Entity category names generally follow the pattern “Output Value” + “@” + “part-of-speech”.

Examples: WPS@Noun, WPS@Verb, WNS@Noun

Possible output values + @ + Possible parts-of-speech

MAP (MajorProblem) @

MIP (MinorProblem) @ Adj

WPS (WeakPositiveSentiment) @ Noun

SPS (StrongPositiveSentiment) @ Verb

WNS (WeakNegativeSentiment) @ Adv

SNS (StrongNegativeSentiment) @

NES (NeutralSentiment) @

In addition to the above acronyms for the dictionary entity categories, some modules use a few more.

See the SAP HANA Text Analysis Extraction Customization Guide for details.


HANA Standard Dictionaries

Die HANA Standard Dictionaries findet man im Repository

unter sap hana ta

Für HANA SPS10 muss gegebenenfalls die Delivery Unit

HANA_TA_VOC.tgz installiert werden

Das Standard Dictionary für deutsche Sentiments ist german-

tf-voc-thesaurus.hdbdict

Best Practice beim Anpassen: Kopieren und die Kopie

anpassen.


Fakten Extraktion: Regeln / CGUL

Bei der Faktenextraktion geht es um die

Sinnextraktion bei Texten

OD definiert eine eigene Entität

TE definiert eine Standardentität

In diesem Beispiel werden folgende Fakten

extrahiert

„liebe Kinder“

Kinder sind positiv belegt, weil lieb = positiv

„der Fuchs stiehlt“

Fuchs ist negativ belegt, weil stehlen = negativ

„der Wolf mit gierigem Blick“

Wolf ist negativ belegt, weil gierig = negativ

Regeldateien werden in der Konfigurationsdatei

(*.hdbtextconfig) hinterlegt und müssen aktiviert

werden.


Von der Faktenextraktion zur Bewertung: TA Index

Die Struktur der Regeldatei liefert auch automatisch die Klammerung:

Sentiment klammert topic und die Bewertung (z.B. StrongPositiveSentiment)

Regeldatei *.hdbtextrule

#group Sentiment { [OD StrongPositiveSentiment][TE SPS]<POS: Adj>?[/TE][/OD] %(Figure)

#subgroup Figure: [OD Topic][TE PERSON] <>? [/TE][/OD]


Von der Faktenextraktion zur Bewertung: Rating


Von der Faktenextraktion zur Bewertung


Was ist CGUL?

CGUL = Custom Grouper User Language

• CGUL allows you to customize extraction functionality by providing the

tools required to define your own custom extraction rules

• CGUL uses regular expressions and pre-defined linguistic attributes

for the entities, relations, or events you need extracted

• CGUL functionality is the last processing stage in the text analysis

pipeline. It occurs after linguistic analysis and entity extraction have

taken place.

• CGUL is a token-based pattern matching language

Topic 8 of 11 | Text analysis: Customizing fact extraction through rules


CGUL Beispiel

Iteration operators:

+, *, ?, {m}, {n,m}

Examples

1. #group PROPERNOUNS: <[A-Z][a-z]+>

2. #subgroup Animals: <POS: Adj>* <STEM:animal>

3. #subgroup GADAFY: <(G|Q)adh?d?h?a+f(y|iy?)>

4. <(ab){2}> versus <(ab)>{2}

5. #define ISSN_Number: [0-9]{4}\-[0-9]{4}

6. #group NounPhrase: <POS: Det>?<POS:

Adj>{0,3}<POS: Nn|Prop>+


Text Analyse - Ausblick

Results are stored in a table and therefore can be leveraged in all supported HANA scenarios:

Standard analytics Create analytic views and calculation views on top

E.g., companies mentioned in news articles over time

Search-based applications Create a search model and build a search UI with Info Access

Results can be used to navigate and filter search results

E.g., people finder, search UI for internal document

Data mining, predictive Use R, Predictive Analysis Library (PAL) functions, Graph, text

mining, …

E.g., clustering, time series analysis, Latent Dirichlet Algorithm, etc.

DankeKontakt:

Dr. Ingo Peter

Solution Architect Data Science

SAP Österreich GmbH

0043 664 6207 391

Documents

Public SAP HANA Text Analyse Wie böse ist der Wolf? · Dr. Ingo Peter, SAP Österreich GmbH 26 April, 2016 SAP HANA Text Analyse –Wie böse ist der Wolf? Public