29
1 2014 SAP AG or an SAP affiliate company. All rights reserved. SAP HANA SPS 11 - What’s New? Search, Text Analysis and Text Mining SAP HANA Product Management December, 2015 (Delta from SPS 10 to SPS 11)

What's new for Text in SAP HANA SPS 11

Embed Size (px)

Citation preview

1© 2014 SAP AG or an SAP affiliate company. All rights reserved.

SAP HANA SPS 11 - What’s New? Search, Text Analysis and Text Mining

SAP HANA Product Management December, 2015(Delta from SPS 10 to SPS 11)

Search

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 3Public

Table

Search Models

In a search model you define the structure of your “search object” and how it is exposed to an application Tables and joins Columns

– Defaults for search– Weights for ranking– Fuzziness – Defaults for facets

Table

Model

Access

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 4Public

*any* View

search annotations

Search Models and Data Access

TableTable

Model

Access

TableTable

OData SQL

CDS Vieww/ search annotations

TableTable

OData SQL

*any* View

search annotations

JSON JSON JSON

CALL ESH_CONFIG(configuration)Built-in procedure to add search annotations (request/response, facets, UI areas etc.) to views

CALL ESH_SEARCH(query,?)Built-in procedure to search on multiple search models with an “OData” query and a “JSON” response

Fiori

SAP HANA SPS10 SAP HANA SPS11

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 5Public

ESH_CONFIG and ESH_SEARCH

SAP HANA SPS11 supports adding “search annotations” to existing views.

Search annotations are added in a CDS-like format, using the built-in procedure ESH_CONFIG.

ESH_SEARCH is the new search API• Federated search across multiple search models in a single call• Based on OData v4, response is JSON• Search specific extensions, e.g. Search.score(), Search.search()

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 6Public

Search Model Example

CALL ESH_CONFIG('[{"uri": "~/$metadata/EntitySets", "method": "PUT","content":{ "Fullname": "DMM264/V_DOCUMENTS", "EntityType": {

"@Search.searchable": true,"@EnterpriseSearch.enabled": true,"Properties": [

{"Name": "ID", "@Search.defaultSearchElement": true, "@EnterpriseSearch.key": true, "@EnterpriseSearch.presentationMode": [ "TITLE" ]},{"Name": "AUTHOR", "@EnterpriseSearch.usageMode": [ "AUTO_FACET" ],"@EnterpriseSearch.presentationMode": [ "SUMMARY" ]},{"Name": "CATEGORY", "@EnterpriseSearch.usageMode": [ "AUTO_FACET" ],"@EnterpriseSearch.presentationMode": [ "SUMMARY" ]},{"Name": "TITLE", "@Search.defaultSearchElement": true, "@EnterpriseSearch.highlighted.enabled": true, "@Search.ranking":

"HIGH","@EnterpriseSearch.presentationMode": [ "TITLE" ]},{"Name": "CONTENT", "@Search.defaultSearchElement": true, "@EnterpriseSearch.snippets.enabled": true,

@Search.fuzzinessThreshold": 0.9, "@Search.ranking": "MEDIUM","@EnterpriseSearch.presentationMode": [ "DETAIL" ]}]

} }}]',?);

existing view

expose as facet

search in this column relevance ranking

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 7Public

Data Access Example

CALL ESH_SEARCH('[

"/$all?facets=all&$filter=Search.search(query=''scope:V_DOCUMENTS merkel'')&$top=10"

]', ?);

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 8Public

Search Response Example

1st result item

2nd result item

1st facet

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 9Public

How to find SAP HANA documentation on this topic?

SAP HANA Advanced Data Processing What’s New in the SAP HANA Advanced Data Proc

essing (Release Notes)

Development– File Loader Guide for SAP HANA

– SAP HANA Search Developer Guide

References – SAP HANA INA Search JavaScript

• In addition to this learning material, you can find SAP HANA documentation on theSAP Help Portal knowledge center at http://help.sap.com/hana_options_adp.

• The knowledge center is structured according to the product lifecycle: installation, security, administration, development.

Text Analysis

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 11Public

Agenda – Text Analysis

New or Improved Features Grammatical Role Analysis Text Analysis XS API – Document Metadata Dictionaries – Case Sensitivity Language Column – SAP Language Codes Tolerant Stemming: Dutch, English, German, Italian Linguistic Analysis: Hungarian and Romanian Core Extraction: Korean Voice of Customer: English and German

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 12Public

New Grammatical Role Analysis (1/2)

Optional analyzer for English that identifies syntactic relationships between elements of a sentence in the form of subject–verb–object expressions, commonly known as ‘triples’.

The [SUBJECT]big brown cat[/SUBJECT] on the red couch was [VERB]eating[/VERB] a [DIRECTOBJECT]dead mouse[/DIRECTOBJECT].

The following grammatical roles describe arguments of verbs that are supported:• Subject person, place, thing, or idea that is doing or being something: Oracle bought Responsys.• DirectObject recipient of the action: Oracle bought Responsys.• IndirectObject affected by the action but not primary object: Oracle offered Responsys an improved contract.• OtherObject often prepositional object: They talked about the contract.• Predicate object of the verb to be: This is a revised version.

An additional grammatical role supported, which does not describe a function with respect to a verb:• PredicateSubject subject of a predicative expression: The contract is new.

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 13Public

New Grammatical Role Analysis (2/2)

Input: Oracle was rumored to buy marketing-software maker Responsys Inc. for $1.5 billion.

Output:TA_RULE TA_COUNTER TA_TOKEN TA_TYPE TA_PARENT TA_OFFSET

Entity Extraction 1 Oracle ORGANIZATION/COMMERCIAL ? 0Entity Extraction 2 marketing-software maker NOUN_GROUP ? 26Entity Extraction 3 Responsys Inc. ORGANIZATION/COMMERCIAL ? 51Entity Extraction 4 $1.5 billion CURRENCY ? 70Grammatical Role 5 Oracle Subject 7 0Grammatical Role 6 Oracle Subject 8 0Grammatical Role 7 rumored Root/MainVerb/Passive ? 11Grammatical Role 8 buy MainVerb/Active ? 22Grammatical Role 9 marketing-software maker Responsys Inc. DirectObject 8 26Grammatical Role 10 $1.5 billion OtherObject/for 8 70

Notes:• Core extraction is included in the configuration (1 - 4)• Each grammatical role is either the governor (verb) or dependent (verb argument)• TA_TYPE holds the details about its grammatical role• TA_PARENT holds the TA_COUNTER value of its corresponding governor• It is possible for a single dependent to be the argument (5 and 6) of two different verbs

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 14Public

Improved Text Analysis XS API – Document Metadata

For on-demand processing, text analysis output can be accessed via the SAP HANA Extended Application Services (XS) API:• Alternative to persisting output data to the $TA table• Bypasses creating the full-text index

Now the following metadata properties for documents can be optionally included:• Author• Date• Date Created• Date Modified• Description• Keyword• Language• Subject• Title

• Version• FromEmailAddress• FromName• ToEmailAddress• ToName• CcEmailAddress• CcName• BccEmailAddress• BccName

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 15Public

Improved Dictionaries – Case Sensitivity

In the context of extraction, dictionaries are user-defined repositories of entities.

Dictionaries are used for customized information about the entities your application must find.

Dictionaries can be used to store name variations in a structured way that is accessible through the extraction process.

Dictionary XML syntax now includes the option:

<dictionary xmlns=“http://www.sap.com/ta/4.0” case-sensitive=“true”>   :     :</dictionary>

For example, adding the attribute will ensure the dictionary entry WHO will match WHO and not who or Who.

The default behavior is case-insensitive.

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 16Public

Improved Language Column – SAP Language Codes

You can specify the input language for each row.

Use LANGUAGE COLUMN to bypass automatic language detection: Specify ISO 639 language code or… SAP language codes can now be optionally utilized

English = E, French = F, German = D, etc.

This option allows configuring full-text search over existing SAP business applications without modifying the underlying database tables.

CREATE FULLTEXT INDEX PRODUCT_REVIEWS_IDX ON PRODUCT_REVIEWS(CONTENT) FAST PREPROCESS OFFLANGUAGE COLUMN LANGUAGE;

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 17Public

Improved Stemming – Dutch, English, German, Italian

Stemming identifies the base form referenced in a dictionary.

Tolerant stemming is introduced for Dutch, English, German and Italian. This default behavior allows for handling non-standard spellings to better maximize recall.

For example in English, the stemmer handles spelling variation found in American and British English; does not require correct capitalization and accentuation and allows required hyphens to be optional.

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 18Public

Improved Language Support for Hungarian and Romanian

Full linguistic analysis support by adding Part-of-Speech (POS) tagging and Noun Group (concept) extraction for Hungarian and Romanian.

Language LINGANALYSIS_BASICLINGANALYSIS_STEMS LINGANALYSIS_FULL

Arabic Catalan Chinese (Simplified) Chinese (Traditional) Croatian Czech Danish Dutch English Farsi French German Greek Hebrew Hungarian NEWIndonesian Italian Japanese Korean Norwegian (Bokmal) Norwegian (Nynorsk) Polish Portuguese Romanian NEWRussian Serbian Slovak Slovenian Spanish Swedish Thai Turkish

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 19Public

TITLE PresidentPERSON Barak Obama

LOCALITY CambridgeREGION@MINOR Napa CountyREGION@MAJOR ConnecticutCOUNTRY BrazilCONTINENT South AmericaGEO_FEATURE Mount FujiGEO_AREA ScandinaviaFACILITY Logan International AirportLOCALITY New Delhi

ORGANIZATION@COMMERCIAL AT&TORGANIZATION@EDUCATIONAL University of WashingtonORGANIZATION@OTHER FBI

SOCIAL_MEDIA@TWITTER_ID @SAP SOCIAL_MEDIA@TWITTER_TOPIC #HANA

DATE 2/14/2011DAY MondayMONTH JuneYEAR 2011

PHONE [email protected]@sap.comURI@URL http://sap.com

 

 

Improved Core Extraction for Korean

Higher precision and recall on existing predefined core extractions.

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 20Public

Improved Voice of Customer – English and German

Set of rules to extract sentiments expressed about a product or a service:I [love] [my new phone]! Strong positive sentiment about my new phoneHe did [not like] [the book]. Weak negative sentiment about the book

More granular than competing systems because it can link sentiments with topics: [love my new phone] = Sentiment

‘love’ = StrongPositiveSentiment‘my new phone’ = Topic

Determiners (above strikethroughs) are now not included with the topic classifications for English and German. This is a change to the previous behavior as it simplifies topic aggregation.

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 21Public

How to find SAP HANA documentation on this topic?

SAP HANA Advanced Data Processing What’s New in the SAP HANA Advanced Data Proc

essing (Release Notes)

Development– File Loader Guide for SAP HANA

– SAP HANA Search Developer Guide

– SAP HANA Text Analysis Developer Guide

References – SAP HANA Text Analysis Extraction Customization Guide

– SAP HANA Text Analysis Language Reference Guide

– SAP HANA Text Analysis XS JavaScript API

• In addition to this learning material, you can find SAP HANA documentation on theSAP Help Portal knowledge center at http://help.sap.com/hana_options_adp.

• The knowledge center is structured according to the product lifecycle: installation, security, administration, development.

Text Mining

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 23Public

Agenda – Text Mining

New or Improved Features Language Support Automatic Stop Words

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 24Public

Text Mining Recap

Text mining provides statistical functions that can compare documents by examining the terms used within them.

The term-document matrix (a.k.a. text mining index) is an optional data structure that is optimized through the results of text analysis.

Text mining is bound to the full-text indexing and text analysis process.

Full-text indexText

analysisresultstable

Full-text indexing with TA and TM

Term-documentmatrix

TM config.

Insert

ID TITLE

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 25Public

New Language Support

Text mining is now natively integrated with all 32 languages supported by text analysis.

It leverages the available text preprocessing steps: Tokenization Stemming Part-of-Speech tagging

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 26Public

Improved Stop Word Customization

Stop words are lists of literal terms to ignore in order to focus on the important content.

Text mining automatically filters terms to include only nouns based on their part-of-speech tags for 31 languages. Optionally, users can manually include additional stop words in the configuration properties.

Language Automatic Stop WordsArabic Catalan Chinese (Simplified) Chinese (Traditional) Croatian Czech Danish Dutch English Farsi French German GreekHebrew Hungarian Indonesian Italian Japanese Korean Norwegian (Bokmal) Norwegian (Nynorsk) Polish Portuguese Romanian Russian Serbian Slovak Slovenian Spanish Swedish Thai Turkish

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 27Public

How to find SAP HANA documentation on this topic?

SAP HANA Advanced Data Processing What’s New in the SAP HANA Advanced Data Proc

essing (Release Notes)

Development– File Loader Guide for SAP HANA

– SAP HANA Search Developer Guide

– SAP HANA Text Mining Developer Guide

References – SAP HANA Text Mining XS JavaScript API

– SQL Reference for Options

• In addition to this learning material, you can find SAP HANA documentation on theSAP Help Portal knowledge center at http://help.sap.com/hana_options_adp.

• The knowledge center is structured according to the product lifecycle: installation, security, administration, development.

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 28Public

Disclaimer

This presentation outlines our general product direction and should not be relied on in making a purchase decision. This presentation is not subject to your license agreement or any other agreement with SAP.

SAP has no obligation to pursue any course of business outlined in this presentation or to develop or release any functionality mentioned in this presentation. This presentation and SAP’s strategy and possible future developments are subject to change and may be changed by SAP at any time for any reason without notice.

This document is provided without a warranty of any kind, either express or implied, including but not limited to, the implied warranties of merchantability, fitness for a particular purpose, or non-infringement. SAP assumes no responsibility for errors or omissions in this document, except if such damages were caused by SAP intentionally or grossly negligent.

© 2015 SAP SE or an SAP affiliate company. All rights reserved.

Thank you

Contact information

Anthony WaiteSAP HANA Product [email protected]