Technical Specifications 7.0 SP1

SAP InfiniteInsight® 7.0 SP1 Technical Specifications

Specifications Document Version: 1.2 – 2014-11

CUSTOMER

CUSTOMER SAP InfiniteInsight® 7.0 SP1 ii © 2014 SAP SE or an SAP affiliate company. All rights reserved- Predictive Analysis / Data Mining Process

Table of Contents 1 Predictive Analysis / Data Mining Process ................................................................................................ 4 1.1 Predictive Model Training ....................................................................................................................................... 5 1.2 Predictive Model Apply ............................................................................................................................................6 1.3 Analytical Data Sets and Data Connectivity .......................................................................................................... 7

2 General Architecture .................................................................................................................................... 10 2.1 Authenticated Server ............................................................................................................................................. 11 2.2 Stand-Alone Workstation ...................................................................................................................................... 13 2.3 Remote Access ...................................................................................................................................................... 14

3 Technology .................................................................................................................................................... 15 3.1 Supported Platforms ............................................................................................................................................. 16 3.2 Expected Behavior on Multi-Processor Architecture ......................................................................................... 17

3.2.1 SAP InfiniteInsight® Threading Policy .................................................................................................. 17

4 Sizing Modeling Servers .............................................................................................................................. 18 4.1 Training Phase ....................................................................................................................................................... 20

4.1.1 RAM Sizing ............................................................................................................................................ 20 4.1.2 Data Transfer ........................................................................................................................................ 20 4.1.3 Temp Disk Space ................................................................................................................................... 21 4.1.4 Disk Space in a Year .............................................................................................................................. 21

4.2 Apply Phase ............................................................................................................................................................ 21 4.2.1 RAM Sizing ............................................................................................................................................. 21 4.2.2 Data Transfer ......................................................................................................................................... 21 4.2.3 Temp Disk Space ................................................................................................................................... 21 4.2.4 Disk Space in a Year .............................................................................................................................. 21

4.3 Sizing Tool ............................................................................................................................................................. 22

5 Network Requirements ............................................................................................................................... 23 5.1 RDBMS Connectivity ............................................................................................................................................ 23 5.2 Client / Server Connectivity ................................................................................................................................ 23

6 Data Access Management .......................................................................................................................... 24 6.1 Access Rights for Files / RDBMS ........................................................................................................................ 24

6.1.1 Rights Definition ................................................................................................................................... 25 6.1.2 Data Access Processes ........................................................................................................................ 25

6.2 Unicode for RDBMS .............................................................................................................................................. 26

7 Other Software Requirements .................................................................................................................... 27 7.1 Standalone Application Mode ............................................................................................................................... 27 7.2 Client/Server Mode .............................................................................................................................................. 28

8 InfiniteInsight® Modeler - Data Encoding Technical Specifications ..................................................... 29 8.1 Features ................................................................................................................................................................. 29

9 InfiniteInsight® Modeler - Regression/Classification Technical Specifications ................................. 31 9.1 Features .................................................................................................................................................................. 31 9.2 Notes ...................................................................................................................................................................... 32

10 InfiniteInsight® Modeler - Segmentation/Clustering Technical Specifications ................................. 33 10.1 Features ................................................................................................................................................................. 33 10.2 Notes ...................................................................................................................................................................... 34

CUSTOMER SAP InfiniteInsight® 7.0 SP1 iii © 2014 SAP SE or an SAP affiliate company. All rights reserved- Predictive Analysis / Data Mining Process

11 InfiniteInsight® Modeler – Time Series Technical Specifications ........................................................ 35 11.1 Features ................................................................................................................................................................. 35 11.2 Notes ...................................................................................................................................................................... 36

12 InfiniteInsight® Modeler - Association Rules Technical Specifications ................................................ 37 12.1 Features .................................................................................................................................................................. 37 12.2 Notes ...................................................................................................................................................................... 38

13 InfiniteInsight® Explorer - Event Logging Technical Specifications ..................................................... 39 13.1 Features ................................................................................................................................................................. 39

14 InfiniteInsight® Explorer - Sequence Coding Technical Specifications ............................................... 40 14.1 Features ................................................................................................................................................................. 40

15 InfiniteInsight® Explorer - Text Coding Technical Specifications .......................................................... 41 15.1 Features .................................................................................................................................................................. 41

16 InfiniteInsight® Explorer - Semantic Layer Technical Specifications ................................................... 43 16.1 Features ................................................................................................................................................................. 43

17 InfiniteInsight® Social Technical Specifications ..................................................................................... 44 17.1 Features ................................................................................................................................................................. 44

18 Geographic Location Support .................................................................................................................... 46 18.1 Features ................................................................................................................................................................. 46

19 InfiniteInsight Recommendation ............................................................................................................... 48 19.1 Features ................................................................................................................................................................. 48

20 Scorer Technical Specifications ................................................................................................................ 49 20.1 Features ................................................................................................................................................................. 49

20.1.1 Without Date Variables ........................................................................................................................ 50 20.1.2 With Date Variables ............................................................................................................................... 51

20.2 Notes ...................................................................................................................................................................... 52

21 InfiniteInsight® Access................................................................................................................................ 53 21.1 ODBC ..................................................................................................................................................................... 53

21.1.1 Platform: A Definition ........................................................................................................................... 53 21.1.2 Reproducibility Issue ............................................................................................................................ 53 21.1.3 List of Platforms Reproduced and Tested ......................................................................................... 54

22 Flat Files ........................................................................................................................................................ 59 22.1 Supported Data Formats ..................................................................................................................................... 59 22.2 Note about Date and Datetime Variables ........................................................................................................... 60

23 SAS Files ........................................................................................................................................................ 61 23.1 Supported Data Formats ...................................................................................................................................... 61

24 Annex ............................................................................................................................................................ 62 24.1 Open Source Software Used in InfiniteInsight® .................................................................................................. 63 24.2 List of Available Binaries ...................................................................................................................................... 64

CUSTOMER SAP InfiniteInsight® 7.0 SP1 4 © 2014 SAP SE or an SAP affiliate company. All rights reserved- Predictive Analysis / Data Mining Process

1 Predictive Analysis / Data Mining Process

In order for the information team to understand the constraints of using predictive analytics software, we think it is good they know a little bit about the data mining process.

Here is a definition of predictive analytics from Jeff Liebl, from Multichannel Merchant.

"Predictive analytics is a process, based on statistical and data mining techniques, that models current and historical customer performance data and traits to make 'predictions' about future outcomes and customer behaviors. These predictions can be expressed as numerical values, or scores, that correspond to the likelihood of a particular occurrence or behavior taking place in the future. In corporate America, predictive scores are typically used to determine the risk or opportunity associated with a specific customer or transaction. These evaluations assess the relationships between many variables to estimate risk or response."

The numerical values or scores are generated by mathematical equations resulting from Predictive Models. This notion of predictive model has nothing to do with data models which is a term used to represent the structure of data as seen from tables’ schema in relational databases.

A simple definition of a Predictive Model may be a mathematic model which describes the relationship between some input data (or variables, or attributes: for example, demographic information known about customers) and some output data (the target variable: for example, the fact that a given customer has bought a product following a marketing campaign or not). There are many application domains for predictive analytics and, if you read this document, it means that your corporation has decided to use it in order to optimize some of their business processes.

This is an extract from Wikipedia (section "Predictive Analytics (http://en.wikipedia.org/wiki/Predictive_analytics)") on a very common application of predictive analytics:

"Direct marketing: Product marketing is constantly faced with the challenge of coping with the increasing number of competing products, different consumer preferences and the variety of methods (channels) available to interact with each consumer. Efficient marketing is a process of understanding the amount of variability and tailoring the marketing strategy for greater profitability. Predictive analytics can help identify consumers with a higher likelihood of responding to a particular marketing offer. Models can be built using data from consumers’ past purchasing history and past response rates for each channel. Additional information about the consumers demographic, geographic and other characteristics can be used to make more accurate predictions. Targeting only these consumers can lead to substantial increase in response rate which can lead to a significant reduction in cost per acquisition. Apart from identifying prospects, predictive analytics can also help to identify the most effective combination of products and marketing channels that should be used to target a given consumer."

IN THIS CHAPTER

Predictive Model Training ....................................................................................................................................... 5 Predictive Model Apply ........................................................................................................................................... 6 Analytical Data Sets and Data Connectivity ........................................................................................................... 7

http://en.wikipedia.org/wiki/Predictive_analytics


1.1 Predictive Model Training

The first resource-intensive phase of the data mining process is the model training.

How does it work in practice? Predictive models are generally built using training samples: lines of data for which the expected value is known. This expected value may be known because it has been collected in the past. For example, the training data set contains people that were active customers at the beginning of the year and the target indicates whether or not these customers bought a particular product in the first 6 months of the year. The expected value may be known because a specific experiment has been run. For example, a first mailing has been sent to a sample of the active customers, and data has been collected to flag which of these active customers have responded favorably to this mailing in a period of three months.

How does this training phase impact the IT resource? Predictive models need access to training "Analytical Data Sets" (ADS) to be built. Using InfiniteInsight®, predictive models are built on a modeling server or workstation on which InfiniteInsight® has been installed. It is very common to have Training Analytical Data Sets that contain a sample of the customer population. These samples are in the range of 50,000 to 500,000 lines, though some clients use Training Analytical Data Sets of more than 30 million lines. Each customer in this case is represented by a line containing attribute values. It is very common to have Training Analytical Data Sets containing from 100 to 2,000 attributes to describe each customer – even if some clients have around 20,000 attributes.

Note For the training phase, InfiniteInsight® must be installed on a computer with access to the data

sources, with reasonable bandwidth to support exchange of 4 Giga-Bytes (500,000 lines times 2,000 columns times 4 bytes). This bandwidth estimation is provided as a starting data point and should be revised for each install.

InfiniteInsight® V5.0 has introduced the notion of cache in order to minimize the number of data transfers between the data source and the modeling server or workstation.


1.2 Predictive Model Apply

The goal of a predictive model is to apply it on data that has not been used for training, providing an estimation of the expected (and yet unknown) target value (hence the name predictive).

To follow the example developed above, once the model has been trained on the data collected through a first mailing wave done on 50,000 sampled customers, it can be applied to the 20 million customers in your customer database. The model uses the same attributes to describe these customers, in order to ‘score’ these customers and compute the probability that they will answer favorably to the mailing campaign. InfiniteInsight® provides some tools for marketing services in order to take into account probability of positive answers, costs of contacts, and expected revenues to optimize the actual list of people to be contacted for this mailing.

How does it work in practice? Predictive models, once trained, can be seen as ‘scoring equations’, transforming input attribute values into a score, a probability, an estimated value, or a segment. To be applied, the scoring equation needs to be fed with an Apply Analytical Data Set. In InfiniteInsight®, there are three options that can be used to apply predictive models:

The most frequent option used by our customers is called the in-database apply, which is automatically called upon when the input data and the scores are to be extracted from and generated within a relational database. In this case, InfiniteInsight® uses its InfiniteInsight® Scorer to automatically generate the scoring equation in SQL (or UDF) and execute this equation on the database. This is the preferred option when the input data is in a relational database. It is also preferred when there are millions of scores to generate.

Another possibility is to export the scoring equation through a specific module called InfiniteInsight® Scorer. The exported scoring code can then be integrated into the scoring environment (examples of export codes are: SAS, PMML, C, C++, Java, SQL, UDF, and JavaScript). This option is preferred when the input data is in a specific format such as SAS, or when the scoring environment is in real-time using the integration of a Java, C++ or PMML scoring equation.

The input data can be transferred to the modeling server or workstation, and the scoring equation can be estimated on this machine through InfiniteInsight® software to return the score which can be transferred back to the score consumer environment. This option uses the batch apply service of InfiniteInsight®.

How does this training phase impact IT resources? The three different options lead to three different operational constraint sets.

The in-database apply option requires licensing for InfiniteInsight® Scorer. The transfer between the data source and the modeling server is limited to the character string containing the scoring code. No data transfer occurs in this option. The entire bulk processing, including the computation of the scores, is done within the relational database.

The export code option requires licensing for InfiniteInsight® Scorer. The management of the generated code is then left to the integration team. The processing is done where the scoring code is compiled executed.

The batch apply service option does not require any extra licensing but may require transfer of large datasets between the data source and the modeling server or workstation. Coming back to the example of a dataset with 20 million lines and 2,000 attributes, this would generate a transfer of ~150 Giga-bytes. This large figure can be lowered by the InfiniteInsight® feature to automatically select input variables for classification and regression. A model built from 2,000 attributes may in the end use only 50 without losing any predictive performance, thus, the Apply Data Set is only required to contain the 50 attributes used in the scoring equation, leading to a smaller transfer (in the 4 Giga-byte range).

Note


All connectivity with relational database systems is performed through ODBC. This requires the proper installation of ODBC drivers on the modeling server or workstation.

It must be noted that KMX generates scoring equations dealing only with the used attributes. When used in conjunction with InfiniteInsight® Explorer - Semantic Layer (ADM), the in-database option will generate the Apply Data Sets containing only the required attributes, thus saving some processing power in the relational database management system.

1.3 Analytical Data Sets and Data Connectivity

The data mining process is presented in the section below:


Phase 1 represents the creation of the Training Analytical Data Set, which is then used in phase 2 in order to build (or train) the predictive model. The Apply Analytical Data Set is then created in phase 3 in order to be fed to the built model for apply. Phase 4 applies the model to the Apply Analytical Data Set to generate the scores, probabilities, or estimated values.

InfiniteInsight® natively reads data from text files and most relational databases through ODBC drivers. By licensing the InfiniteInsight® Access module, InfiniteInsight® can use analytical datasets from various sources including SAS, SPSS, and Excel.

The Data Extraction process is covered by SAP InfiniteInsight® through specific modules such as InfiniteInsight® Explorer - Event Logging, InfiniteInsight® Explorer - Sequence Coding, InfiniteInsight® Explorer - Text Coding, InfiniteInsight® Social, and a specific management system for relational databases called InfiniteInsight® Explorer - Semantic Layer.

InfiniteInsight® Explorer - Event Logging/InfiniteInsight® Explorer - Sequence Coding/InfiniteInsight® Explorer - Text Coding/InfiniteInsight® Social features are subject to specific licenses. They are provided as services in InfiniteInsight®, thus running in the modeling server or workstation. They create internal variables within SAP InfiniteInsight® predictive models in order to improve their predictive power:

InfiniteInsight® Explorer - Event Logging is used to create thousands of possible aggregates over multiple time periods, pivoted on categories, using relative reference dates. For example, it can be used to create the average amount bought by each customer each month for six months starting from the date at which this customer has bought a specific product, for each product brand. InfiniteInsight® Explorer - Event Logging can thus be used to explore which aggregates are useful to improve the predictive model’s performance.

InfiniteInsight® Explorer - Sequence Coding is used to create variables to collect transitions between events. For example, it is often used to compute transitions between pages on a web site for each web session, allowing predictive models to use this information to predict shopping behavior during the web session.

InfiniteInsight® Explorer - Text Coding is used to automatically transform textual fields (such as emails, or free-form text fields in marketing surveys) into ‘root words’ that can also be used by predictive models. For example, an insurance claim letter or email can be decomposed into words to detect if some words are predictive for fraud.

InfiniteInsight® Social is used to extract information from graphs such as found in social network sites or in calling patterns for telecommunications or email exchanges between individuals. InfiniteInsight® Social can generate variables to collect information on the direct first circle of customers and also to detect communities and compute variable profiles on these circles or communities. For example, the ratio of churners in the community may be used by a predictive model to predict the probability of churn for a given customer of a telecommunications operator.

Note InfiniteInsight® Explorer - Event Logging, InfiniteInsight® Explorer - Sequence Coding, and

InfiniteInsight® Social build internal representations in order to compute the values of their generated variables. They require large RAM configurations. This is particularly true for InfiniteInsight® Social which has been used to generate graphs representing links between tens of millions telecommunications customers.

InfiniteInsight® Explorer - Event Logging, InfiniteInsight® Explorer - Sequence Coding and InfiniteInsight® Social are better suited for 64-bit architectures.


The Analytical Data Management module follows a different design. It is based on an SQL generator technology, which is optimized for each supported relational database (such as Oracle, Teradata, SQL-server, DB2, and MySQL). It provides all functions required for data manipulation in order to create analytical data sets, plus a technique that manages the evolution of these analytical data sets through time. ADM is provided to customers if they have licensed either InfiniteInsight® Explorer - Event Logging or InfiniteInsight® Explorer - Sequence Coding.

Note Since SAP InfiniteInsight® V5.1, in order to avoid the misuse of complex queries by data mining

processes or users, we have included the use of ‘explain’ features from the major databases to estimate complexity of the generated queries. This allows our customers to implement policies based on this complexity.

Other examples of features that we have implemented in order to minimize the load on the relational database include:

The optimization of the queries for each specific SQL (using OLAP extensions, sub queries, or correlated queries when building aggregates).

The use of temporary tables when needed in order to minimize some computational overlap when a given variable is used in multiple expressions.

Generating specific queries when the user wants to see the data manipulation results on the first 100 lines or to compute the descriptive statistics on the first 2,000 lines, for example.

CUSTOMER SAP InfiniteInsight® 7.0 SP1 10 © 2014 SAP SE or an SAP affiliate company. All rights reserved- General Architecture

2 General Architecture

SAP InfiniteInsight® can be deployed as a stand-alone process (which may be remotely accessed), or as a true client-server.

SAP InfiniteInsight® provides a user interface written in Java called InfiniteInsight® modeling assistant. The same user interface is used in all deployment architectures, and is not subject to specific license.

IN THIS CHAPTER

Authenticated Server ............................................................................................................................................ 11 Stand Alone-Workstation ...................................................................................................................................... 13 Remote Access .................................................................................................................................................... 14 CORBA Client/Server ........................................................................................................................................... 14


2.1 Authenticated Server

The client-server architecture is based on a master process that manages authentication (checking login/password) and impersonation (creation of one process per connected client session) and controls the processes created for each client session.

Communication between the server and the data can use ODBC, a native file system, SAP InfiniteInsight® advanced Access for SAS, SPSS, Matlab files, or a customized connection depending on the type of data being accessed.

Communications between the clients (running a graphical user interface written in Java called the InfiniteInsight® modeling assistant) and the SAP InfiniteInsight® server are based on CORBA-IIOP which uses SSL encryption. Clients are aware of the server locations through access to the CORBA Yellow Pages mechanism called ‘NameServer’.

In this solution, the SAP InfiniteInsight® Modeling Assistant only performs graphical operations: heavy processing is performed on the server. SAP InfiniteInsight® modeling assistant user interfaces must be installed on these clients (and updated when a new SAP InfiniteInsight® version is installed). SAP InfiniteInsight® provides a Java Web Start infrastructure to allow IT organizations to disseminate clients when a new SAP InfiniteInsight® version is available.

A single license file (containing the list of modules purchased by the customer) for the Modeling server is required.

Besides the SAP InfiniteInsight® Modeling Assistant, SAP InfiniteInsight® provides a graphical user interface called SAP InfiniteInsight® Admin Console in order to provided external sessions management.


SAP InfiniteInsight® Authenticated server main features are:

Authentication of users: users are required to enter a valid user/password to start a modeling session.

Start SAP InfiniteInsight® Instance per session: once a user is authenticated, a SAP InfiniteInsight® instance process is started for the modeling session.

Communication between client and server is encrypted to enforce security (password and data are not sent as plain text through the network).

The advantages of this solution include:

The network configuration is light, as only two ports (one for the ‘NameServer’, and one for the Authenticated Server) must be opened since all communications from clients are directed to the Authenticated Server.

Each modeling session can use the maximum process memory size, without sharing it with other client processes.

A user can close the SAP InfiniteInsight® modeling assistant while a long-running session still occurs (such as a long model training for example), and connect back to this session later.

Any problem occurring for one user session does not impact the other users. The memory resources allocated to one session is released once the session is terminated (when the

client GUI is closed for example). Operating system rights can be used to check access to the different resources (for example,

modeling data). One license is required on the modeling server for any number of clients. User activity monitoring and logging is possible and activated by default.

The constraints for this solution are related to the installation process:

Configuration of the authentication system within SAP InfiniteInsight®. Setting specific rights for the account under which the SAP InfiniteInsight® Authenticated Server

process will run.

Note On UNIX operating systems, the authentication is managed through PAM (Pluggable Authentication

Module) that allows the authentication to be ported onto the operating system authentication or any other such as LDAP based systems.


2.2 Stand-Alone Workstation

As a stand-alone process, SAP InfiniteInsight® is a 2-Tier architecture.

Communication between the server and the data can use ODBC, a native file system, SAP InfiniteInsight® advanced Access for SAS, SPSS, Matlab files, or a customized connection depending on the type of data being accessed.

The main advantage is the simplicity of this architecture and the absence of competition on a shared resource (such as the CPU on a server). Of course, it requires enough CPU and memory resource on the workstation. It is possible to use this architecture in conjunction with remote access file protocols (such as Samba or Windows Services for UNIX) to access remote files on a server (for example, accessing remote SAS files on a data server).

Each workstation requires its own license file (since license files are node-locked, which means they are specific for each machine on which InfiniteInsight® is installed).

The communication between the Graphical User Interface and SAP InfiniteInsight® engines is carried out through Java Native Interface (JNI) in the same process.

The advantage of this solution is:

The installation process takes only 5 minutes

The constraints for this solution include:

InfiniteInsight® will use as much CPU power as it can when a training session starts, thus making the workstation difficult to use for other applications (this is less of a problem on multi-core workstations).

One license is required for each workstation, which can be seen as a major issue when managing a large number of workstations.

SAP InfiniteInsight® upgrades must be done on each workstation. Activity Logging is spread on each workstation in the case of multiple workstations.


2.3 Remote Access

A common alternative is to deploy the SAP InfiniteInsight® stand-alone version on a server, and allow remote access to the server through a standard remote access product such as Windows Remote Desktop, PC Anywhere, Citrix, XWin-32, Exceed, or VNC as shown in the following figure.

The advantages of this solution include:

The server CPU and memory is shared, while freeing used resources after each user session, and still allowing authentication to be performed on the server.

SAP InfiniteInsight® upgrades have to be installed only on the server. The login accounts on the server are used to filter access to data sources.

The constraints for this solution include:

Installation of a third party remote access tool which is compliant with the security policy of the enterprise.

The third party tool must be compatible with applications developed in Java 1.6 to avoid low-level communications (bitmap transfers for example).

CUSTOMER SAP InfiniteInsight® 7.0 SP1 15 © 2014 SAP AG or an SAP affiliate company. All rights reserved- Technology

3 Technology

SAP InfiniteInsight® is written in C++. SAP InfiniteInsight® is provided as an API under several formats (C++ loadable library, CORBA server on all supported platforms, COM library and DCOM server for Windows).

IN THIS CHAPTER

Supported Platforms ............................................................................................................................................. 16 Expected Behavior on Multi-Processors Architecture ........................................................................................... 17

CUSTOMER SAP InfiniteInsight® 7.0 SP1 16 © 2014 SAP SE or an SAP affiliate company. All rights reserved- Technology

3.1 Supported Platforms

SAP InfiniteInsight® is written in C++ and is provided as an API under several formats (C++ loadable library, CORBA server on all supported platforms, COM library and DCOM server for Windows).

The supported platforms to date are:

Platform OS Processors Arch. Access JNI PAM SSL

Windows Windows (7 SP1, 8) Intel 32-bit (IA32) 32-bit x x x

Windows Windows (7 SP1, 8, 8.1, Server 2008 R2) Intel 64-bit (X64) 64-bit x x x

Sun SPARC Solaris 11.1 and later version

SPARC v9 64-bit x x x x

IBM AIX 7.1.0 Technical Level 3 and later versions PowerPC 64-bit x x x x

RedHat ES6

RedHat Enterprise Linux 6.4 and later versions Intel 64-bit (X64) 64-bit x x x x

SuSE 11 SuSE Linux Enterprise Server 11 (x86_64)

Patch Level 3

Intel 64-bit (X64) 64-bit x x x x

The last five columns of this table indicate if the platform supports the following features:

Access InfiniteInsight® Access provides access to various external data format such as SAS® files, SPSS®, Matlab®, Microsoft Excel files and more.

JNI Java Native Interface

SAP InfiniteInsight® provides Java wrappers on top of both the C++ library (through JNI) and the CORBA server to ease integration with J2EE environment.

SAP InfiniteInsight® supports the following Java Virtual Machine:

SUN JVM 1.6 for Client/Servers installations SUN JVM 1.6 and above for standalone installations

PAM Pluggable Authentication Module

SAP InfiniteInsight® Authenticated Server supports users authentication through PAM service. Depending on the PAM configuration for SAP InfiniteInsight®, PAM itself can then perform authentication using various mode (through UNIX passwords, Kerberos, ...)

SSL Secure Socket Layer

The communication between the Java Client and the Server can be encrypted using SSL communication instead of regular TCP networking.

Notes See the Annex (see page 53), to know which binary corresponds to each platform.

For All Intel 32-bit processors (IA32) we recommend large L1 / L2 cache sizes For Sparc processors (Sparc v8 or v9) we recommend processors with correct Floating Point

Unit. For example, we do not recommend UltraSparc T1, as it shares one FPU for 8 Processing Units.

The SAP InfiniteInsight® Connector is not available on Solaris 10 X64 platform since DataDirect is not yet supporting this platform.

SAP InfiniteInsight® is Unicode enabled, which means that data and meta-data (such as the name of columns) can be provided under native or Unicode character sets. The API is however fully Unicode based.

CUSTOMER SAP InfiniteInsight® 7.0 SP1 17 © 2014 SAP SE or an SAP affiliate company. All rights reserved- Technology

3.2 Expected Behavior on Multi-Processor Architecture

The behavior of an application on a multi-core architecture depends on its internal threading policy.

On Symmetric Multi-Processing architecture (SMP - most common architecture for multi-processors), the Operating System will run all threads concurrently that are started by an application.

Each thread can only be run at one time on a single core, meaning that a single thread cannot be executed on more than one core at a time. However, over the lifetime of the thread, the operating system can switch a thread from one core to another.

3.2.1 SAP InfiniteInsight® Threading Policy

In SAP InfiniteInsight®, a predictive model is generally run in a single thread. There is one dedicated thread for each ‘model building’ process, or ‘model applying’ process.

This means that:

a single model will use only one core for learn or apply, when running several models (learn or apply) concurrently, several cores (one for each model) can

be used.

Note that in some cases, the limiting factor might be the Input/Output speed for reading data. For example, if the data are located on a remote server (file or DB access), connected through a slow network, the CPU might not be used at 100%.

CUSTOMER SAP InfiniteInsight® 7.0 SP1 18 © 2014 SAP SE or an SAP affiliate company. All rights reserved- Sizing Modeling Servers

4 Sizing Modeling Servers

Prior to installing InfiniteInsight® on computers used for predictive analytics, it is important to understand how to size these machines according to usage patterns.

SAP InfiniteInsight® provides a sizing tool in Excel for this purpose, but it is interesting to understand the rules behind this sizing tool to be able to customize the results for each particular case.

Important Notice Resource sizing is a difficult exercise. This tool is provided in order to present some data-based

elements. Results should be analyzed and interpreted with care.

Sizing exercises provide estimates for 5 elements:

RAM size (used by all models in the modeling processes) The temporary disk space (used when caching on disk is activated) The disk space used to store predictive models Transfer size between data sources and the modeling computer Number of cores

In order to get a proper evaluation of the sizing, the team in charge of modeling needs to provide the following input:

The types of models to be built (such as classification, regression, clustering, segmentation, time series forecasting, association rules). Since SAP InfiniteInsight® models are self-contained, they contain all descriptive statistics for each variable for each internal dataset (the ‘Training’ analytical dataset is decomposed internally into partitions called ‘estimation’, ‘validation’ and an optional ‘test’ dataset). The modeling team does not have to keep any meta-data repository for their analytical objects, but, note that SAP InfiniteInsight® models take more disk space than if only the scoring equation is kept. Some models may be heavy in both disk space and memory consumption such as clustering, which keeps all cross-statistics for all input variables for all clusters.

If the modeling team policy is to build models with the optional test dataset, most models will require 1/3 more RAM and disk space to be saved, since the statistics on the test dataset is contained in the model (both in RAM and on file). In most situations, this flag should be set to 0 (the default when running model training in SAP InfiniteInsight®).

The flag called ‘Interactive’ must be set to 1 when SAP InfiniteInsight® is used from any user interfaces. This requires more RAM since, after a model training, the model is translated into a parameter tree used to transport information used in reports or any visualization panel. In most situations this flag should be set to 1.

The number of concurrent models. In simple organizations, the number of concurrent models is linked to the number of concurrent users, but some modeling teams may decide that each user may run several modeling sessions in parallel. The number of concurrent models should count not only the models managed through interactive sessions (when predictive analytics is used to discover relationships between data elements), but also through the batch sessions, which are usually run through scheduled tasks, using InfiniteInsight® Factory KxShell scripts.

The maximum use of the modeling computer taking into account the number of concurrent models, either run through interactive sessions or through batch sessions. Interactive sessions are more memory consuming than batch sessions since all information computed by the models may be used through the user interface.


Size of the analytical data set (for training and apply). The training data sets are always transferred between data sources and modeling computers. An approximated transfer size is derived using the following equation: #rows x # columns x 4 bytes.

Some other inputs must be provided by the information technology team, most importantly:

Whether or not to use the cache: using the cache requires more resources on the computing server but frees a lot of bandwidth for data transfer between the data sources and the modeling computers. Customers may decide to deactivate the cache. The cache feature is activated by default on 64-bit architectures and not activated by default on 32-bit architectures. The cache impacts the RAM consumed during training since datasets are stored in memory to speed up multiple sweeps.

When activated, the cache stores data into an L1 cache (usually in memory) up to a user-specified value (set by default to 500MB). The tool considers that L1 is assigned to memory.

When a data set is larger than the L1 cache, the remaining data is stored into an L2 cache (usually in temporary disk space) up to a user-specified value (set by default to 1024 MB). The tool considers that L2 is assigned to disk.

The cache is only activated when dealing with ODBC sources.

Future versions of the sizing tool may take into account the following flags:

32- or 64-bit architecture preferred for modeling computers: due to RAM requirements for building large models, we recommend 64-bit architecture, especially for modeling servers. 32-bit architectures should be kept for workstations. The effect on RAM sizing should be minor.

The fact that the preferred operating system flavor is Windows or UNIX: Internal customer policies will dictate if the information technology team prefers using Windows or UNIX operating systems. This said, when dealing with a modeling server for multiple clients, UNIX operating systems tend to scale better when spreading tasks and processes on multiple cores.

SAP InfiniteInsight® provides an Excel™ tool to help this sizing process. The next section describes the computations underlying this tool.

IN THIS CHAPTER

Training Phase ..................................................................................................................................................... 20 Apply Phase ......................................................................................................................................................... 21 Sizing Tool ............................................................................................................................................................ 22


4.1 Training Phase

4.1.1 RAM Sizing

Memory size depends on the model type (classification, regression, clustering, segmentation, time series, or association rules). Most model sizes are impacted by the size of the analytic data sets, but also by some other aspects such as the average number of categories for discrete variables. Association rules memory consumption depends heavily on the user-defined parameters which are the support and confidence of the rules. SAP InfiniteInsight® sizing tool makes some hypothesis based on the databases available for non-regression tests.

The second important element is the activation of cache in RAM. Since version V5, SAP InfiniteInsight® provides a cache mechanism in order to minimize the data transfer between the data sources and the computing server. Besides minimizing the data transfer, data caching is also interesting when dealing with complex queries run by relational databases in order to compute on the fly the analytical data sets. When the data is cached, the complex query is run only once. This caching mechanism is useful for the training phase. Data cache sizes may be tuned from the configuration file at installation time. The default configuration depends on the modeling computer architectures:

For 32-bit architecture no data is cached in memory, and data can be cached in local temporary files when under 500 Mega-Bytes.

For 64-bit architecture, data is cached in memory up to 500 Mega-Bytes and extra data is cached in temporary files when under 2 Giga-Bytes.

These limits are provided in order not to pollute smaller configurations with data in memory or on temporary files. They can be increased for larger modeling computer configurations. When the analytical data sets used for training are larger than these configured limits, InfiniteInsight® does not cache the data and reverts to multiple sweeps over the data sets, thus increasing the network traffic, and making the training process slower. The cache is only valid when data is read from relational database sources. This is an incentive to store data in relational data sources rather than text files or SAS files.

For training models, the memory consumption can be computed as the sum of:

The model size, which depends mainly on the number of input variables (columns of the training dataset) since most of the memory is taken to hold statistics of the input variables.

The memory used the cache in L1 (if the datasets are small enough to hold in the cache – when datasets are larger than the values provided by L1 and l2, then the cache is automatically deactivated)

The memory which is taken by the parameter tree in order for the user interface to communicate with the models, and provide reports to the user (which is only true for interactive sessions).

4.1.2 Data Transfer

When a model is trained, the data is taken from the data source and read in SAP InfiniteInsight® modeling server. Most of SAP InfiniteInsight® algorithms require several sweeps on the training data sets. This explains why the data transfer can be important when the cache is not used. The data transfer size is equivalent to the size of the training data set when the cache is activated, but is a multiple of this value when the cache is not activated, depending on the number of sweeps of the algorithm (which is only estimated for segmentation since the number of sweeps depends upon the data itself).


4.1.3 Temp Disk Space

When cache is activated and when the data set size is larger than the user specified limit for L1, the remaining part of the data set is stored into a L2 space (usually disk space).

4.1.4 Disk Space in a Year

When a large number of models need to be saved, customers must be aware of the size it takes to store all these models on disk. The size a model is taking on disks is very close to the size of a model in RAM.

4.2 Apply Phase

4.2.1 RAM Sizing

The size of the RAM taken to apply a model is the same size than after training.

4.2.2 Data Transfer

In connection with ODBC sources, when InfiniteInsight® Scorer is purchased, there is no data transfer since the SQL query representing the scoring equations are sent directly to the data base that plays the role of the scoring engine. When the source is not an ODBC source or when InfiniteInsight® Scorer has not been purchased, the data corresponding to the apply data set needs to be transferred to the SAP InfiniteInsight® server.

4.2.3 Temp Disk Space

There is no temporal storage on disk during apply.

4.2.4 Disk Space in a Year

When a large number of scores need to be saved, customers must be aware of the size it takes to store all these scores. As a convention, we have taken the size of the scores equivalent to a data set with 4 columns (the identifier of the model, the identifier of the date, the identifier of the customer to be scored and the score itself).


4.3 Sizing Tool

SAP InfiniteInsight® provides a Microsoft Excel™ tool to help this sizing process.

CUSTOMER SAP InfiniteInsight® 7.0 SP1 23 © 2014 SAP AG or an SAP affiliate company. All rights reserved- Network Requirements

5 Network Requirements

IN THIS CHAPTER

RDBMS Connectivity ............................................................................................................................................ 23 Client / Server Connectivity ................................................................................................................................... 23

5.1 RDBMS Connectivity

The SAP InfiniteInsight® server has to be connected with data sources. Data sources may be files (in standard formats such as CSV, or in proprietary formats such as SAS ™), but, most often, operational data sources are implemented through data base systems (RDBMS). SAP InfiniteInsight® supports ODBC connectivity with a list of supported database vendors (Oracle, Teradata, SQL server, DB2, MySQL, Netezza, and more). As said earlier, in the preferred embodiment of SAP InfiniteInsight® installation, data transfers between such databases and SAP InfiniteInsight® servers will be for the analytical training data sets. Common scenarios involve training data sets of 500,000 lines and 1,000 columns, which are not large with respect to the bandwidth capacity of LAN.

5.2 Client / Server Connectivity

Each of the three components involved in the application (the CORBA Name Server, the SAP InfiniteInsight® Server and the Client Application) can be located on a different machine (although in most installations, both the Name Server and the SAP InfiniteInsight® Server will be located on the same Server).

The communication protocol used under the CORBA framework is the TCP protocol. The first requirement is that the:

The SAP InfiniteInsight® Server must be capable to access the CORBA Name Server The Client application must be capable to access both the CORBA Name Server and the SAP

InfiniteInsight® Server.

From a Network Administration point of view it means that:

The CORBA Name Server and the SAP InfiniteInsight® Server will be assigned a specific TCP port. By default, the startup scripts we use fix them, and they should probably set properly depending on the network strategy.

Both of these ports must be accessible from the Client machine (for example network firewalls should allow communication on these ports between Client machine and the servers).

In the following we will name "CORBA Name Server port" the TCP port used by the CORBA Name Server and "SAP InfiniteInsight® Server port" the TCP port used by the SAP InfiniteInsight® Server.

CUSTOMER SAP InfiniteInsight® 7.0 SP1 24 © 2014 SAP SE or an SAP affiliate company. All rights reserved- Data Access Management

6 Data Access Management

InfiniteInsight® Access is a solution for Accessing Data in a wide variety of formats. It allows reading and writing to and from SAS files, SPSS, files, Minitab files, Excel files and several other file types.

ODBC

The section ODBC (on page 53) in the Data Access chapter lists the platforms that have been tested by our development team and that can be easily reproduced.

Platforms

In SAP InfiniteInsight® context, a platform is to be considered the combination of the following elements:

the client platform, where SAP InfiniteInsight® is running the DBMS (Database Management System) platform, where the DBMS providing the data is running the DBMS the ODBC driver, which is the software layout exposing the data from the DBMS in a standard way the ODBC driver manager, which is an intermediate software layer managing the ODBC drivers

installed on a computer

IN THIS CHAPTER

Access Rights for Files / RDBMS .......................................................................................................................... 24 Unicode for RDBMS .............................................................................................................................................. 26

6.1 Access Rights for Files / RDBMS

Since all models are built based on the extraction of information from data, SAP InfiniteInsight® integrates different data access instances to read several types of files formats - SAS, SPSS, Excel, Minitab, delimited text, and fixed length text -, tables and views or to select statements from data bases through ODBC connections (available on both Windows and UNIX platforms).

For efficiency matters, SAP InfiniteInsight® does not create a separate analytics data store. It reads the data from the existing data sources, and saves output and models back into the data source.


6.1.1 Rights Definition

Data source rights are defined either via the database management system or via the OS system (for flat files).

For each data source:

The right... Allows the user to...

Read Data read the data stored on the data store

Write Data save data on the data store

6.1.2 Data Access Processes

When working in a client-server mode, all the data is accessed by the server. The data source (ODBC, ...) must be correctly installed (drivers, ...) and configured on the server.

Note No data access is done from the client environment, all data access is performed on the server side.

This means that the server is only able to access the data present on the server host (and not on the client machine). It is however possible to set up the client and the server on the same physical host, so that all data available on the client side is also available for the server.

When using SAP InfiniteInsight® components via a graphical interface, the data access processes are performed in a seamless manner for the user. They only have to select the data source format to be used ("flat files" or ODBC-compatible data sources) and specify the location of the source.

Note Data access is done through the notion of a cursor (or line iterator) and SAP InfiniteInsight® provides a

technique to allow integrators to write their own driver to connect SAP InfiniteInsight® to their proprietary data storage mechanism. The C Data Access API is intended for developers who want to write access code for proprietary format databases.


6.2 Unicode for RDBMS

SAP InfiniteInsight® is Unicode-enabled: data and metadata (such as the name of columns) can be provided under native or Unicode character sets. UTF-8 and UTF-16 formats are supported for input data.

A "native" file format is also supported. On UNIX platforms the native file format is the one described in the LANG environment variable. On Windows it corresponds to whatever is announced by the system or the CODEPAGE environment variable.

SAP InfiniteInsight® uses its own conversion functions to handle UTF-8/UTF-16 formats. Native character sets are handled using UNIX iconv and Windows native functions.

Two Unicode options can be associated to a data source:

supportUnicodeOnData describes the kind of character conversion supported by the ODBC driver/ODBC Driver Manager/DBMS in the record data.

supportUnicodeOnMeta describes the kind of character conversion supported by the ODBC driver/ODBC Driver Manager/DBMS in the object names (tables, fields, indexes).

The default values for these options follows the standard SAP InfiniteInsight® behavior related to MultiLanguageIsDefault option:

MultiLanguageIsDefault=no (default setting) This behavior is compatible with previous versions: input/output is done in native character sets (client code page). SAP InfiniteInsight® ODBC layout works in conjunction with mini drivers in order to read and emit characters in client code page. MultiLanguageIsDefault=yes All input/output are done in Unicode (UTF-16) characters. SAP InfiniteInsight® ODBC layout works in conjunction with mini drivers in order to read and emit characters in UTF-16 format.

The MultiLanguageIsDefault option is a global switch to the full SAP InfiniteInsight® engine and can be overloaded by supportUnicodeOnMeta / supportUnicodeOnData options for every DSN.

To apply these Settings on a Standalone Installation of SAP InfiniteInsight® Update the KJWizard.cfg file located in <INSTALLATION DIR>\EXE\Clients\KJWizardJNI\.

Example

For version 6.5 SP4: C:\Program Files\SAP InfiniteInsight\II_V6.5.4\EXE\Clients\KJWizardJNI\

To apply these Settings on a Client/Server Installation of SAP InfiniteInsight® Update the KxCORBA.cfg file located in <INSTALLATION DIR>\EXE\Servers\CORBA\.

Example

For version 6.5 SP4: C:\Program Files\SAP InfiniteInsight\II_V6.5.4\EXE\Servers\CORBA\

CUSTOMER SAP InfiniteInsight® 7.0 SP1 27 © 2014 SAP SE or an SAP affiliate company. All rights reserved- Other Software Requirements

7 Other Software Requirements

IN THIS CHAPTER

Standalone Application Mode ............................................................................................................................... 27 Client/Server Mode ............................................................................................................................................... 28

7.1 Standalone Application Mode

When deploying InfiniteInsight® as a standalone application with the Windows Client Installer – which means that InfiniteInsight® runs on the machine without having to access a server – an ODBC manager and drivers for a specific RDBMS (optional) must be installed if the data is stored in a database.

Note SAP InfiniteInsight® provides its own Java Runtime Environment (JRE 1.6) in the installation package.

CUSTOMER SAP InfiniteInsight® 7.0 SP1 28 © 2014 SAP SE or an SAP affiliate company. All rights reserved- Other Software Requirements

7.2 Client/Server Mode

There are two ways to deploy InfiniteInsight®:

using a client installer (local installation), or using Java Web Start (web access to the application).

According to the installation mode, the following requirements must be met:

Installation Mode Requirements on the Server

Requirements on the Client Note

Client Application ODBC manager and drivers for a specific RDBMS (optional): in case the data is stored in databases.

/ SAP InfiniteInsight® provides its own Java Runtime Environment (JRE 1.6) in the installation package.

Java Web Start Client Web Server: Windows IIS, Apache Server (http://www.apache.org), ...

ODBC manager and drivers for a specific RDBMS (optional): in case the data is stored in databases.

Java Runtime Environment (JRE) version 1.6 or above (available on the Java website (http://www.oracle.com/technetwork/java/javase/downloads/index.html)).

Web browser: MS-Internet Explorer version 5.0 or above (available on the Microsoft website (http://download.microsoft.com)), ...

Users have to install a compatible version of Java Runtime Environment (JRE 1.6 or above) on the client machine.

An installation through Java Web Start may seem more demanding, however it provides additional features that will save time such as automatic updates. For more information on Java Web Start, refer to the SAP InfiniteInsight® Java Web Start Installation guide.

Note There is no need to install Java Runtime Environment (JRE) on the server side.

http://www.apache.org/

http://www.apache.org/

http://www.oracle.com/technetwork/java/javase/downloads/index.html





http://download.microsoft.com/

http://download.microsoft.com/

CUSTOMER SAP InfiniteInsight® 7.0 SP1 29 © 2014 SAP SE or an SAP affiliate company. All rights reserved-InfiniteInsight® Modeler - Data Encoding Technical Specifications

8 InfiniteInsight® Modeler - Data Encoding Technical Specifications

InfiniteInsight® Modeler - Data Encoding (formerly known as K2C) is a data preparation transform for building a consistent (robust) coding scheme for any attribute belonging to a training data set containing a business question (specific target attribute to analyze). For example, each possible value (category) for a nominal attribute is either discarded as not consistent, or coded as a number for later use by subsequent transforms. Each ordinal attribute is provided as its natural order or encoded with respect to the target when available. Each continuous attribute is provided as a normalized number or encoded with respect to the target when available.

InfiniteInsight® Modeler - Data Encoding brings intelligence to any OLAP system (IOLAPTM) through the determination of an optimal banding and binning strategy to explain a measure of a cube structure.

8.1 Features

• Scalability: InfiniteInsight® Modeler - Data Encoding is linear in number of lines and columns. • Data Passes: InfiniteInsight® Modeler - Data Encoding processes the Estimation and Validation sets

in a single pass for each. • Inputs: Inputs can be ordinal, nominal or continuous. • Targets: Targets can be ordinal, nominal or continuous. • Results:

• Segments for continuous values in order to build histograms and quantiles information, • Level of under-representation for nominal categories to be discarded and collapsed into a

miscellaneous class called ‘KxOther’, • Groups of initial segments or categories for all specified targets to realize best compromise

between fit and robustness, • Quality and robustness indicators for all input variables (each input variable can be considered

as a univariate model with its own quality and robustness indicators). • Output: InfiniteInsight® Modeler - Data Encoding does not generate any specific output but coding of

variables as requested by components such as InfiniteInsight® Modeler - Regression/Classification and InfiniteInsight® Modeler - Segmentation/Clustering.

• Parameters: • User Enable Compress: a Boolean flag allowing the user (when set to false) to deactivate the

target based optimal grouping performed by InfiniteInsight® Modeler - Data Encoding on this single attribute.

• User Band Count: a number, present only for continuous attributes, allowing the user to change the number of bands (segments, bins) to collect statistics on this attribute from the default of twenty.

• User Enable KxOther: a Boolean flag, present only for nominal attributes, allowing the user (when set to false) to deactivate the compression into KxOther for very infrequent categories. Note that this will generally lead to non-stable data representation and coding, as well as increased memory and processor consumption.

CUSTOMER SAP InfiniteInsight® 7.0 SP1 30 © 2014 SAP SE or an SAP affiliate company. All rights reserved-InfiniteInsight® Modeler - Data Encoding Technical Specifications

• Nominal Groups, Ordinal Bands, and Continuous Bands: parameters that can be used by the user to force a data structure. This can be used to force a drilling hierarchy for example, segmenting age in user-defined segments to be used by SAP InfiniteInsight® modeling techniques.

• User Modulus: a parameter that allows the user to enforce the bands of the continuous attributes to be the modulus of the given value. For example, this allows the user to enforce the fact that bands are always a multiple of 1000 when dealing with monetary values.

CUSTOMER SAP InfiniteInsight® 7.0 SP1 31 © 2014 SAP SE or an SAP affiliate company. All rights reserved-InfiniteInsight® Modeler - Regression/Classification Technical Specificati

9 InfiniteInsight® Modeler - Regression/Classification Technical Specifications

InfiniteInsight® Modeler - Regression/Classification (formerly known as K2R) trains models by implementing a mapping between a set of descriptive attributes (model inputs) and target attributes (model outputs). It belongs to the regression family of algorithms, and can be used to solve binary classification and regression mining functions. InfiniteInsight® Modeler - Regression/Classification is not a “text book” regression algorithm such as linear least squares.

It uses a proprietary algorithm, an SAP InfiniteInsight® derivation of a principle described by V. Vapnik as "Structural Risk Minimization". The returned models are expressed as a polynomial expression of the input attributes.

InfiniteInsight® Modeler - Regression/Classification also allows the specification of a weight attribute for each training row in order to adapt the cost function to the user requirements. By default without a weight attribute, each training row is considered to be of equal value. The output model can be analyzed in terms of attribute contributions weighing the relative importance.

InfiniteInsight® Modeler - Regression/Classification can be used in any Attribute Importance function. InfiniteInsight® Modeler - Regression/Classification brings Intelligence to any OLAP system (IOLAPTM) through the determination of the important dimensions that can be used to explain a measure of a cube structure.

9.1 Features

• Scalability: The behavior of InfiniteInsight® Modeler - Regression/Classification is linear with the number of lines and linear with the number of input attributes of the training set. However, it is combinatorial with respect to the degree of the polynomial since higher polynomial degrees greatly increase the number of input attributes.

• Data Passes: InfiniteInsight® Modeler - Regression/Classification requires two passes on the Estimation data set, and one pass on the Validation data set.

• Inputs: Inputs can be ordinal, nominal or continuous. • Targets: Targets can be binary nominal or continuous. • Results:

• Attribute importance in order to quantify the relative importance of each input in predicting the targets.

• The quality and robustness indicators of the estimation of the targets as well as some common measures (such as classification rate for classification tasks or mean square error or Pearson coefficient for the regression case)

• Outputs: • Estimation of continuous targets, • Decision and/or probabilities associated with binary classification. • InfiniteInsight® Modeler - Regression/Classification can also generate an outlier indicator when

the target is known and statistically far from the estimation of this target. • InfiniteInsight® Modeler - Regression/Classification can also generate error bars for continuous

estimates.

CUSTOMER SAP InfiniteInsight® 7.0 SP1 32 © 2014 SAP SE or an SAP affiliate company. All rights reserved-InfiniteInsight® Modeler - Regression/Classification Technical Specificati

• Parameter: Polynomial Order: User can specify a polynomial order greater than 1.

9.2 Notes

1 Multi-nominal targets are not yet directly supported. A user can create a disjunctive coding of a multi-nominal target with as many Boolean attributes as there are categories, train a model with all these attributes as targets, and combine the probability outputs to make a final classification.

2 Ordinal targets are accepted by the algorithm but the proper debriefing in the form of a confusion matrix (leading to classification rates) is not yet directly supported.

CUSTOMER SAP InfiniteInsight® 7.0 SP1 33 © 2014 SAP SE or an SAP affiliate company. All rights reserved-InfiniteInsight® Modeler - Segmentation/Clustering Technical Specificati

10 InfiniteInsight® Modeler - Segmentation/Clustering Technical Specifications

InfiniteInsight® Modeler - Segmentation/Clustering (formerly known as K2S) trains models by implementing a mapping between a set of descriptive attributes (model inputs) and the ID (model output) of one of several clusters/segments computed by the system. It belongs to the family of clustering/segmentation algorithms for training descriptive models. The goal of these models is to gather similar data into groups. The question of similarity is discussed below.

The current version of InfiniteInsight® Modeler - Segmentation/Clustering first builds prototypes in order to minimize the intra-distances of cases within clusters and maximize the inter-distances between different clusters. This notion of distance can be based on the input distributions when no target is provided, but, it must be noted that InfiniteInsight® Modeler - Segmentation/Clustering is more powerful when used for ‘supervised segmentation’. In this case a target is used to encode all inputs and provide a notion of distance which is meaningful for the application. Similar to InfiniteInsight® Modeler - Regression/Classification, the target is any attribute relevant to the user's business. For example, the purchase amount for a customer, the response to a marketing campaign, or the fact that an individual churned in the last two months.

InfiniteInsight® Modeler - Segmentation/Clustering uses a derivation of SRM to compute a short logical expression for each cluster through a feature called “SQL Expressions”. For example a cluster may be defined as "age <= 35 AND marital-status in ['Divorced']". This has several advantages:

Logical expressions are generally very easy and natural to interpret. The segmentation process is easier to integrate in operational environments such as relational

databases through SQL.

10.1 Features

• Scalability: The behavior of InfiniteInsight® Modeler - Segmentation/Clustering is linear with the number of lines, more than linear with the number of columns.

• Data Passes: InfiniteInsight® Modeler - Segmentation/Clustering processes data with 4 sweeps on the Estimation data set and one sweep on the entire Training data set. When the ‘segmentation’ mode is selected, the number of passes is proportional to the longest statement found in a cluster expression (rarely above 7).

• Inputs: Inputs can be ordinal, nominal or continuous. • Targets: Targets are optional and can be binary nominal or continuous. • Results:

• Short logical expressions for each cluster (when ‘segmentation’ mode is chosen). • Global statistics for each cluster: all descriptive statistics between each cluster and each

selected input. • Frequency: percentage of population gathered in the cluster.

• Results provided when supervised: • % of 'label' in classification case (binary target): percentage of label in the cluster, where label is

the least frequent category of the binary target. • Target Mean in regression case (continuous target): mean value of the target for data assigned

to the cluster.

CUSTOMER SAP InfiniteInsight® 7.0 SP1 34 © 2014 SAP SE or an SAP affiliate company. All rights reserved-InfiniteInsight® Modeler - Segmentation/Clustering Technical Specificati

• The KI and KR that can be associated with the cluster ID. • Output: Cluster or segment indices in different formats. • Parameters:

• Number of Clusters. The user must specify the number of clusters. • Type of distance internally used: city-block, Euclidian, or absolute difference. • The encoding strategy may be tuned in some ways.

10.2 Notes

1 Multi-nominal targets are not yet directly supported. A user can create a disjunctive coding of a multi-nominal target with as many Boolean attributes as there are categories, train a model with all these attributes as targets.

2 Ordinal targets are accepted by the algorithm but the proper debriefing in the form of a confusion matrix (leading to classification rates) is not yet directly supported.

CUSTOMER SAP InfiniteInsight® 7.0 SP1 35 © 2014 SAP SE or an SAP affiliate company. All rights reserved-InfiniteInsight® Modeler – Time Series Technical Specifications

11 InfiniteInsight® Modeler – Time Series Technical Specifications

InfiniteInsight® Modeler – Time Series (formerly known as KTS) lets you train predictive models from data representing time series. With InfiniteInsight® Modeler – Time Series models, you can:

Identify and understand the nature of time series through trends and cycles. Forecast the evolution of a time series in the short and medium term, that is, to predict their future

values.

InfiniteInsight® Modeler – Time Series breaks a time series into four components:

Trend: The trend represents the evolution of a time series over the period analyzed. The trend is represented either by a function of time or by signal differentiating, which is calculated in InfiniteInsight® Modeler – Time Series using the principle that a new value can be predicted based on only the previous known value. Calculating the trend allows InfiniteInsight® Modeler – Time Series to build a stationary representation of the signal (that is, the time series does not increase or decrease any more). This stationary representation is essential for the analysis of the three other components.

Cycles: The cyclicity describes the recurrence of a variation in the signal. It is important to distinguish calendar time from natural time. These two time representations are often out of phase. The former - which is referred to as seasonality - represents dates (day, month, year and so on), while the latter - which is referred to as periodicity - represents a continuous time (1, 2, 3 and so on).

Fluctuations: Fluctuations represent disturbances that affect a time series. In other words a time series does not only depend on external factors but also on its last states (memory phenomena). InfiniteInsight® Modeler – Time Series tries to explain parts of the fluctuations by modeling them on past values of the time series (ARMA or GARCH models).

Information Residue: The information residue is the information that is not part of the trend, cycles, or fluctuations. As such, predictive models generated by InfiniteInsight® Modeler – Time Series are characterized only by the first three components - trend, cycles and fluctuations.

11.1 Features

• Scalability: InfiniteInsight® Modeler – Time Series is usually used on small data sets • Data Passes: InfiniteInsight® Modeler – Time Series internally computes a lot of models that are

compared for best results. This leads to a number of passes between 6 and 10 passes on Estimation and more than 12 passes on Validation depending on the number of internal cycles found.

• Inputs: InfiniteInsight® Modeler – Time Series needs an ordered data set with values associated with a column of type date, date-time or number. Extra input variables can be specified.

• Targets: The signal to forecast must be continuous. • Results:

• Signal components as trends, cycles, and seasons. • Internal contribution of lagged variables when such model is chosen.

• Outputs: • Forecasts at the selected horizon. • Error bars around the forecasts • Forecast decomposition.

CUSTOMER SAP InfiniteInsight® 7.0 SP1 36 © 2014 SAP SE or an SAP affiliate company. All rights reserved-InfiniteInsight® Modeler – Time Series Technical Specifications

• Parameters: • Number of Forecasts: Users must specify how many forecasts they want to produce. • Explanatory Attributes: Users can specify extra inputs used as explanatory attributes.

11.2 Notes

1 When using explanatory attributes, several rows of data must be provided with non-blank input values beyond the last row used for training. The number of extra rows corresponds to the number of forecasts requested.

2 The current version of InfiniteInsight® Modeler – Time Series does not accept missing values for the signal. 3 Dates to be associated with the forecasts can be either provided in the apply data set or will be generated

approximately by InfiniteInsight® Modeler – Time Series. 4 There is no formal constraint about the fact that each line should be associated with dates (or times) with

a constant period, but this is the intent of the underlying algorithms.

CUSTOMER SAP InfiniteInsight® 7.0 SP1 37 © 2014 SAP SE or an SAP affiliate company. All rights reserved-InfiniteInsight® Modeler - Association Rules Technical Specifications

12 InfiniteInsight® Modeler - Association Rules Technical Specifications

InfiniteInsight® Modeler - Association Rules (formerly known as KAR) generates association rules. Association rules provide clear and useful results, especially for market basket analysis. They bring to light the relations between products or services and immediately suggest appropriate actions. Association rules are used in exploring categorical data, also called items. Items belong to Transactions.

The strengths of InfiniteInsight® Modeler - Association Rules are:

to produce clear and understandable results, to support unsupervised data mining (no target attribute), to explore very large data sets thanks to its ability to first generate rules on parts of the data set

before aggregating them (exploration by chunks), to generate only the more relevant rules (also called primary rules).

Once rules are generated, they can be used in apply mode in order to generate items that could be included in the transactions.

The SAP InfiniteInsight® implementation uses a third generation algorithm (‘A Priori’ algorithm belongs to the first generation, ‘FP-Tree’ algorithm belongs to the second generation) which can be used to generate only meaningful rules where all other techniques return a lot of redundant rules. This allows for both scalability and minimizing the number of generated rules without loss of information.

12.1 Features

• Scalability: We have developed an incremental version of our algorithm, which greatly reduces the memory consumption.

• Data Passes: One pass on the transaction data set, two passes when the incremental version is activated.

• Inputs: InfiniteInsight® Modeler - Association Rules needs an Events data set with transaction identifiers and item identifiers. The items will be used to generate the rules. InfiniteInsight® Modeler - Association Rules also needs a Training data set that contains the ticket identifiers that will be used to train the system and build the rules.

• Results: • Rule expression with corresponding quality indicators for each rule • Descriptive statistics on the items and transactions.

• Output: Recommendations which can be seen as items associated with probabilities. • Parameters:

• Confidence allows filtering rules by the probability associated with the consequent. • Support allows filtering rules by their frequency. • Size of rules. • Possibility to deactivate SAP InfiniteInsight® specific optimization techniques to compare with

other environments.

CUSTOMER SAP InfiniteInsight® 7.0 SP1 38 © 2014 SAP SE or an SAP affiliate company. All rights reserved-InfiniteInsight® Modeler - Association Rules Technical Specifications

12.2 Notes

When the incremental option is selected, the transactions must come grouped through the transaction index. The system will exit on error if all items of the same transaction are not provided contiguously.

CUSTOMER SAP InfiniteInsight® 7.0 SP1 39 © 2014 SAP SE or an SAP affiliate company. All rights reserved-InfiniteInsight® Explorer - Event Logging Technical Specifications

13 InfiniteInsight® Explorer - Event Logging Technical Specifications

The purpose of InfiniteInsight® Explorer - Event Logging (formerly known as KEL) is to build a mineable representation of event history. For example, InfiniteInsight® Explorer - Event Logging can be used to represent RFA (Recency-Frequency-Amount) views of a customer based on purchase history.

It is not a data mining algorithm but a data preparation transform. As discussed above, the input data for the regression, classification and clustering transforms require a single data set with a fixed number of attributes. However, commonly a customer is associated with several events (purchase history for example) with a variable number for every customer. This list of events across several rows must be translated into a single row with a fixed number of attributes. These types of operations are called pivoting in data mining because they translate information contained in the same attribute of different rows into different attributes on a single row (for a given customer identifier for example). InfiniteInsight® Explorer - Event Logging can be used to represent any time-date stamped history, such as the history of a customer, or the history of a log of defects associated with a machine in a network.

This component merges static information (single value per ID) and dynamic information (multiple values per ID). The user must have these two data sets before using the component. The data set containing static information is generally called the "reference" data set, and it is associated in the models with the classical data set names or roles such as Training, Estimation, Validation, Test or ApplyIn. The data set containing the log of events (sometimes called the "transactions" table) is associated with a label beginning with the string "Events". InfiniteInsight® Explorer - Event Logging is said to build coarse grain representations as it summarizes the events into different periods of interest.

13.1 Features

• Scalability: The scalability of InfiniteInsight® Explorer - Event Logging is linked to the memory of the server because a temporary version of all aggregates is created.

• Data Passes: One pass on the Events data set. • Inputs: InfiniteInsight® Explorer - Event Logging needs an Events data set with some continuous

attributes to be aggregated, such as amounts. For example, each row could represent the fact that a customer has bought a product at this time for this amount. Possible types of aggregations are minimum, maximum, sum, average, and count. InfiniteInsight® Explorer - Event Logging also needs a Training data set that contains the identifiers that will be used to group the Events (in the previous example, it could be the customer identifier).

• Output: Aggregates as specified by the user • Parameters:

• Profile of the period of aggregations • Aggregation functions for each period and between periods.

CUSTOMER SAP InfiniteInsight® 7.0 SP1 40 © 2014 SAP SE or an SAP affiliate company. All rights reserved-InfiniteInsight® Explorer - Sequence Coding Technical Specifications

14 InfiniteInsight® Explorer - Sequence Coding Technical Specifications

The purpose of InfiniteInsight® Explorer - Sequence Coding (formerly known as KSC) is to build a mineable representation of event history. For example of InfiniteInsight® Explorer - Sequence Coding can be used to represent web log sessions. InfiniteInsight® Explorer - Sequence Coding is able to represent each session as both the count of pages and the transitions between pages (or meta-information about the pages).

It is not a data mining algorithm but a data preparation transform. As discussed above, the input data for the regression, classification and clustering transforms require a single data set with a fixed number of attributes. However, commonly a customer is associated with several events (purchase history for example) with variable number for every customer. This list of events across several rows must be translated into a single row with a fixed number of attributes. These types of operations are called pivoting in data mining because they translate information contained in the same attribute of different rows into different attributes on a single row (for a given customer identifier for example). InfiniteInsight® Explorer - Sequence Coding can be used to represent any time-date stamped history, such as the history of a customer, or the history of a log of defects associated with a machine in a network.

This component merges static information (single value per ID) and dynamic information (multiple values per ID). The user must have these two data sets before using the component. The data set containing static information is generally called the "reference" data set, and it is associated in the models with the classical data set names or roles such as Training, Estimation, Validation, Test or ApplyIn. The data set containing the log of events (sometimes called the "transactions" table) is associated with a label beginning with the string "Events". InfiniteInsight® Explorer - Sequence Coding is said to build fine-grained representations as it summarizes the count of different events or even the transitions between different events for a given reference object.

14.1 Features

• Scalability: The scalability of InfiniteInsight® Explorer - Sequence Coding is linked to the memory of the server because a temporary version of all aggregates is created.

• Data Passes: Two passes on the Events data set. • Inputs: InfiniteInsight® Explorer - Sequence Coding needs an Events data set with some discrete

attributes representing events to be counted or for which transitions will be counted. For example, each row could represent a click on a given Web page. InfiniteInsight® Explorer - Sequence Coding also needs a Training data set that contains the identifiers that will be used to group the events (in the previous example, it could be the web session identifier).

• Output: Count on each selected type of event or transitions between events. • Parameters:

• List of selected events used to filter • Specific types of counts or transition counts for events • Flag indicating if each transaction should be encoded or only the entire sessions.

CUSTOMER SAP InfiniteInsight® 7.0 SP1 41 © 2014 SAP SE or an SAP affiliate company. All rights reserved-InfiniteInsight® Explorer - Text Coding Technical Specifications

15 InfiniteInsight® Explorer - Text Coding Technical Specifications

InfiniteInsight® Explorer - Text Coding (formerly known as KTC) is a solution for Text Analytics. It automatically prepares and transforms unstructured text attributes into a structured representation to be used within the SAP InfiniteInsight® modeling components.

InfiniteInsight® Explorer - Text Coding automatically handles the transformation from unstructured data to structured data going through a process involving “stop word” removal, merging sequences of words declared as 'concepts', translating each word into its root through “stemming” rules, and merging synonyms. InfiniteInsight® Explorer - Text Coding allows text fields to be used “as is'” in classification, regression, and clustering tasks. It comes packaged with rules for several languages such as French, German, English and Spanish, and can be easily extended to other languages.

InfiniteInsight® Explorer - Text Coding improves the quality of predictive models by taking advantage of previously unused text attributes. For example, messages, emails sent to a support line, marketing survey results, or call center chats can be used to enhance the results of models for cross-sell or attrition.

15.1 Features

• Scalability: The behavior of InfiniteInsight® Explorer - Text Coding is linear with the number of lines of the data set. The average size of the texts contained will influence the computing time.

• Data Passes: If InfiniteInsight® Explorer - Text Coding has to recognize the language, a first path of the data set will first be processed. In order to create a dictionary for every textual variable, one pass on the training data set is done. If the data set has more than one textual variable, the number of passes does not grow with the number of textual variables to process.

• Inputs: The data set must contain at least one variable containing text and must be declared as storage string and value textual.

• Results: • Language recognized for all the textual variables (one language for all) • Dictionary containing all the selected roots for every textual variable • Statistics on Stemming Rules usage • Statistics on every root of every dictionary

• Outputs: • Language Recognition for each line: Add a variable that indicates the recognized language for

every text • Vectorization: Add variables corresponding to the text representation in the dictionary. The

type of the value depends of the Encoding parameter • Generate only Roots: Only displays variables corresponding to the text representation in the

dictionary. The type of the value depends of the Encoding parameter • Transactional Mode: Creates a transactional file that has for every text, X lines corresponding

to the roots for the text in the order of appearance. • Parameters:

• Repository containing the language files • List of excluded languages • Language recognition

CUSTOMER SAP InfiniteInsight® 7.0 SP1 42 © 2014 SAP SE or an SAP affiliate company. All rights reserved-InfiniteInsight® Explorer - Text Coding Technical Specifications

• Possibility for the user to specify the language • Language processing options • Encoding parameters for vectorization

CUSTOMER SAP InfiniteInsight® 7.0 SP1 43 © 2014 SAP AG or an SAP affiliate company. All rights reserved-InfiniteInsight® Explorer - Semantic Layer Technical Specifications

16 InfiniteInsight® Explorer - Semantic Layer Technical Specifications

SAP InfiniteInsight® provides a module to edit, save, and retrieve data manipulations as described in the document Data Manipulation: Use Case Scenarios. When data stores (directories or ODBC sources) are associated with a repository containing data manipulations, these connectors appear as regular files or tables and can be used directly (like other data) to train or apply models.

One of the useful features of InfiniteInsight® Explorer - Semantic Layer is the ability to declare arguments. Arguments are symbols with associated values that can be changed before executing the data manipulations. They can be used anywhere within InfiniteInsight® Explorer - Semantic Layer.

InfiniteInsight® does not offer a special engine to execute these data manipulations, since they can all be performed by standard SQL engines embedded with all major relational databases. Instead, InfiniteInsight® Explorer - Semantic Layer can be seen as an object-oriented layer that is used to generate data manipulation statements in SQL, which are processed, in turn, by the data base server.

16.1 Features

Results: Time-stamped population (snapshots of the entities and a given time) Filtered Time-Stamped Population Cross Product Time-Stamped Population Compound Time-Stamped Population Temporal analytical data sets

Outputs: Data manipulation aggregates (based on conditions, expressions, string or date manipulations,

etc.) Possibility to create targets on the fly

Parameters: Entity (the object of interest targeted by an analytical task). Analytical Record (logical view of all attributes corresponding to an entity) Possibility to define domains in the analytical record (group of attributes describing a

homogeneous section of an entity). Time stamp variable prompts Business performance indicators (a signal associating dates with one or several metrics). Repository containing the metadata

CUSTOMER SAP InfiniteInsight® 7.0 SP1 44 © 2014 SAP SE or an SAP affiliate company. All rights reserved- InfiniteInsight® Social Technical Specifications

17 InfiniteInsight® Social Technical Specifications

The purpose of InfiniteInsight® Social is to build attributes that can augment customer profiles (or any other object of interest) derived from graph structures. These graph structures can be extracted from an event history, such as contacts between customers or employees or transactions linking customers and products. For example, in the telecommunications industry, InfiniteInsight® Social can be used to build different social networks based on Call Detail Records.

In this sense, InfiniteInsight® Social should be seen as a data preparation transform used to extract information from graph structures to generate a fixed number of derived attributes. InfiniteInsight® Social can be used to build a set of graphs and to extract properties for each customer, by analyzing its connectivity and profiling its neighborhood within the graph set. InfiniteInsight® Social also has an algorithmic aspect linked to the automated discovery of communities within graphs and the computation of derived attributes linked to statistics obtained on these communities (such as the density of the community associated with each customer).

This module can merge static information (single value per identifier) and dynamic information (multiple values per identifier). The data set containing the interaction events is called the Link Data Set and it contains information used to build the links (or edges) for the graph set. The optional data set containing static information is called the Node Decoration Data Set, as it provides information on entities that will be the graph nodes. Finally, an additional Identifier Conversion Data Set can be added to join the Link and the Node Decoration data sets together, if their respective identifiers are not the same. It is a two-column data set that translates identifiers from the Link Data Set into identifiers of the Node Decoration Data Set (for example, phone line numbers and client identifiers).

InfiniteInsight® Social can be called a Social Network Analysis component as it offers a way to get a view on social interactions hidden in raw data and to derive attributes from these structures.

17.1 Features

Scalability: The scalability of InfiniteInsight® Social is linked to the memory of the server as all graphs are stored in the main memory. This explains why it is recommended to use InfiniteInsight® Social only on 64-bit servers with a fair amount of RAM.

Data Passes: One pass on the Link Data Set. Inputs: InfiniteInsight® Social needs a Link Data Set with a least two columns (source node and target

node) to build the graph set. The Node Decoration Data Set is optional and allows InfiniteInsight® Social to aggregate properties (means, mode, and profile) for a given node neighborhood. The Identifiers Conversion Data Set is also optional and can be specified for convenience.

Outputs: Connection Analysis: information on the degree centrality (number on neighbors) Circle Analysis: aggregates properties of direct neighbors Centrality Analysis: information on a node influence potential (computed through mathematical

metrics) Neighbors Mode: lists the neighbors of a node

CUSTOMER SAP InfiniteInsight® 7.0 SP1 45 © 2014 SAP SE or an SAP affiliate company. All rights reserved- InfiniteInsight® Social Technical Specifications

Parameters: The graph loading parameters allow building multiple graphs from a single Link Data Set by using different filtering mechanisms such as creating a graph per period of time or per type of interaction.

CUSTOMER SAP InfiniteInsight® 7.0 SP1 46 © 2014 SAP SE or an SAP affiliate company. All rights reserved- Geographic Location Support

18 Geographic Location Support

Geographic location support provides a way to associate latitude and longitude variables in input datasets as a composite variable of type position (representing a point on the geosphere) that can then be used by InfiniteInsight Modeler and InfiniteInsight Social.

Given a dataset in which events contain geographic information, InfiniteInsight computes the geographical surface areas (expressed as tiles) containing those events. The following are calculated for each tile:

weighted count (the number of rows in the dataset that correspond to a point situated in the tile) frequency (the percentage of rows in the tile), density (the ratio between surface of the tile and weighted count)

When specifying a target, every tile is characterized by the density of events and also by the proportion of targeted events in the tile.

Tile information can be exported in formats compatible with geographic information systems (of type GeoServer) to be visualized using software such as Google Earth.

Colocation capabilities detect transactions or events occurring in the same location during the same period of time. Frequent path analysis extracts a succession of tiles that are frequently followed by items or events. InfiniteInsight Social provides specialized workflows to easily do a colocation or frequent path analysis.

18.1 Features

Input: A dataset containing: Geographic data: longitude and latitude variables (continuous) with storage type angle and a

composite variable of type position created from the latitude and longitude variables An entity representing the geo-located item (an event or transaction, for example) A date associated with the geo-located item (optional)

Targets: Targets are optional and can be binary or continuous. Results:

Tiles (a defined surface area) covering the events in a dataset Tile density (frequency of events per tile) In InfiniteInsight Social, a bipartite graph of type proximity. The graph links entities to the tiles in

which events involving these entities took place. A link represents the event that links the entity to the tile in which that event occurred. If a date variable is selected, the date result is contained in weight of the link.

KML (Keyhole Markup Language) file Shapefile that is compatible with web feature service (WFS) to allow requests via a web server to

manipulate and retrieve geographic data. URI to send segments to a geographic information system (GIS) and to launch the visualization

software. The address of the (GIS) is defined in the geographic system protocol in the modeling assistant options.

Parameters Threshold used to define geographic proximity (tile size in meters) Time threshold used to define if two events occur at the same time Thresholds to limit the number of paths in a segment or colocations to include in the analysis

CUSTOMER SAP InfiniteInsight® 7.0 SP1 47 © 2014 SAP SE or an SAP affiliate company. All rights reserved- Geographic Location Support

CUSTOMER SAP InfiniteInsight® 7.0 SP1 48 © 2014 SAP SE or an SAP affiliate company. All rights reserved- InfiniteInsight Recommendation

19 InfiniteInsight Recommendation

InfiniteInsight Recommendation allows you to make recommendations (similarly to Association Rules or Market Basket Analysis) by generating rules (for example, purchasing Product A leads to purchasing Product B).

Recommendation uses the link analysis technique implemented in InfiniteInsight Social. This technique is optimized to work on large volumes of transactions. Recommendation triggers all the existing rules in a projected graph whose antecedent is a neighbor of the given user in the bipartite graph. For example, a user is shown to have purchased 4 products in the bipartite graph. These 4 products are antecedents for a set of rules found in the projected graph. The items returned by the apply mode are the consequents of the rules contained in the projected graph.

Recommendation provides a specialized workflow to make it easy to obtain a set of recommendations for a given customer.

19.1 Features

Input: Recommendation needs a transactional dataset containing user identifiers (users or sessions) and item identifiers (products). The dataset can also contain a transaction date and a weight to be associated with the transaction (for example, number of items purchased)

Results: Computed rules and details on the computation. Outputs:

• Recommendations (for each customer) with confidence • Node displays for visualizing relationships between items and transactions (for example,

purchases by a user weighted by frequency, all users who purchased an item, recommendations between products)

• Statistical reports (rule distribution (how items are related), item connectivity (transactions by item), user connectivity (items by transaction), items with the highest number of recommendations, items with the highest number of connections)

Parameters: • Minimum Support (how many times a rule must be found in the dataset to be considered valuable

as a recommendation) • Minimum Confidence (the confidence (as a percentage) below which a rule is not used as a

recommendation) • Minimum Predictive Power (quality threshold below which a rule is not considered valuable as a

recommendation) • Minimum number of items to define a bestseller (default 50,000) • Whether or not to include bestsellers (megahubs) • Whether or not to recommend items already purchased by the customer

CUSTOMER SAP InfiniteInsight® 7.0 SP1 49 © 2014 SAP SE or an SAP affiliate company. All rights reserved- Scorer Technical Specifications

20 Scorer Technical Specifications

InfiniteInsight® Scorer generates source code in different formats for InfiniteInsight® Modeler - Data Encoding/InfiniteInsight® Modeler - Regression/Classification and InfiniteInsight® Modeler - Data Encoding/InfiniteInsight® Modeler - Segmentation/Clustering models. The supported output codes are:

C MYSQL SQL PMML2 AWK HTML (JavaScript) SAS, JAVA BASIC TERADATA UDF DB2 UDF Oracle UDF SQLServer UDF Score Card in HTML

20.1 Features

Scores obtained by using the generated codes should be the same as those obtained with SAP InfiniteInsight®. However, slight differences may exist, mainly due to precision issues in computation. The following tables sum up known problems.

Caption

Color Meaning

++ Syntax is correct and results are the same as SAP InfiniteInsight® engines (1).

+ Syntax is correct and results differ from less than 10 errors.

! Syntax is correct but results are different (2).

!! Code not implemented.

X Code is generated, but problems can occur on some systems due to system limitations.

Not tested

(1) Results may be slightly different due to precision issues, especially with models with a lot of variables.

(2) Database types without RTrim (right trim, automatically suppressing whitespaces at the end of a string) consider as different two categories with names only differing by an ending whitespace.


20.1.1 Without Date Variables

Key Code Classification/ Regression order 1

Classification/ Regression order 2

Segmentation/ Clustering

Segmentation/Clustering with SQL Expression

C ++ ++ ++ ++

JAVA ++ ++ ++ ++

PMML3.2 ++ !! ++ !!

AWK ! ! ! !

CPP ++ ++ ++ ++

SAS ++ ++ ++ ++

SQLServer ++ !! !! ++

SQLServerUDF ++ !! !! ++

HANA (1) ++ !! !! ++

ORACLE ++ !! !! ++

OracleUDF ++ !! !! ++

SQLDB2 ++ !! !! ++

DB2UDF ++ !! !! ++

DB2V9 ++ !! !! ++

SQLTeradata ++ !! !! ++

TERAUDF ++ !! !! ++

MYSQL ++ !! !! ++

MYSQLUDF ++ !! !! ++

SybaseIQ ++ !! !! ++

SybaseIQUDF ++ !! !! ++

SQLNetezza ++ !! !! ++

SQLVertica ++ !! !! ++

PostgreSQL ++ !! !! ++

Geenplum ++ !! !! ++

Hive ++ !! !! ++

Note (1) InfiniteInsight® Scorer manages SAP HANA column and row storage.

Caution Only SQLServer key code handles trimmed data during its execution. For other codes, if data are

not trimmed it may generate some differences.


20.1.2 With Date Variables

Key Code Classification/ Regression order 1

Classification/ Regression order 2

Segmentation/ Clustering

Segmentation/Clustering with SQL Expression

C ++ ++ ++ ++

JAVA ++ ++ ++ ++

PMML3.2 !! !! !! !!

AWK !! !! !! !!

CPP ++ ++ ++ ++

SAS ++ ++ ++ ++

SQLServer ++ !! !! ++

SQLServerUDF ++ !! !! ++

HANA (1) ++ !! !! ++

ORACLE ++ !! !! ++

OracleUDF ++ !! !! ++

SQLDB2 ++ !! !! ++

DB2UDF ++ !! !! ++

DB2V9 ++ !! !! ++

SQLTeradata ++ !! !! ++

TERAUDF ++ !! !! ++

MYSQL ++ !! !! ++

MYSQLUDF ++ !! !! ++

SybaseIQ ++ !! !! ++

SybaseIQUDF ++ !! !! ++

SQLNetezza ++ !! !! ++

SQLVertica ++ !! !! ++

PostgreSQL ++ !! !! ++

Greenplum ++ !! !! ++

Hive ++ !! !! ++

Notes (1) InfiniteInsight® Scorer manages HANA column and row storage.


Caution Only SQLServer key code handles trimmed data during its execution. For other codes, if data are not

trimmed the application may generate some differences.

20.2 Notes

The limit for the number of parameters of a UDF is independent of SAP InfiniteInsight® and is determined by the DBMS limitations. The following table details the maximum number of arguments allowed for a UDF with respect to the DBMS:

DBMS Number of parameters

SQLServer 2000 1024

Oracle 128

Teradata 128

DB2 90

CUSTOMER SAP InfiniteInsight® 7.0 SP1 53 © 2014 SAP SE or an SAP affiliate company. All rights reserved- InfiniteInsight® Access

21 InfiniteInsight® Access

InfiniteInsight® Access is a solution for Accessing Data in a wide variety of formats. It allows reading from and writing to SAS files, SPSS, files, Minitab files, Excel files, and several other file types.

21.1 ODBC

This section lists the ODBC platforms that have been tested by our development team and that can be easily reproduced.

21.1.1 Platform: A Definition

In the context of SAP InfiniteInsight®, a platform is considered to be the combination of:

The client platform where SAP InfiniteInsight® is running. The Database Management System (DBMS) platform where the DBMS providing the data is running. The DBMS. The ODBC driver. This is the software layout exposing the data from the DBMS in a standard way. The ODBC driver manager. This is an intermediate software layer managing the ODBC drivers installed

on a computer.

21.1.2 Reproducibility Issue

With regard to the complex combination of setting parameters that platforms constitute, extensively exploring potential platforms appears to be a huge task.

To enhance SAP InfiniteInsight® reproducibility and test processes on platforms, we are currently working on a tool that will let us:

Test new platforms more easily, Integrate all test platforms on daily non-regression tests.


21.1.3 List of Platforms Reproduced and Tested

All the platforms listed in the table below have been tested at least once but not necessarily re-tested for each version of SAP InfiniteInsight®. However we can easily reproduce them for further testing if need be. Nevertheless, some features (Data Manipulation or InfiniteInsight® Scorer) may not be available or may be untested yet for some platforms.

DBMS InfiniteInsight Engine OS ODBC Driver ODBC Manager

Data Manipulation Scorer

Access Windows 32 bits (1) Microsoft 4.00 Microsoft

db2 9.5 Linux X86 64 bits v91fp7 unixODBC Not tested Not tested

db2 9.5 Windows 32 bits (1) IBM 9.05.00.808 Microsoft

db2 9.5 Windows 64 bits (1)

IBM DB2 ODBC DRIVER 9.05.00.808 Microsoft Not tested Not tested

Greenplum 4.2 Windows 32 bits (1)

DataDirect 7.1 SP3 Greenplum Wire protocol 7.10.0.72 Microsoft

Greenplum 4.2 Windows 64 bits (1)

DataDirect 7.1 SP3 Greenplum Wire protocol 7.10.0.72 Microsoft

Greenplum 4.2 Linux X86 64 bits DataDirect 7.1 SP3 Greenplum Wire protocol 7.10.0.72 DataDirect

HANA 1.0 rev 70 Windows 32 bits (1)

HDBPDBC sp7 rev 0 (HDBODBC 1.00.7.0.58439) Microsoft


HDBPDBC sp7 rev 0 (HDBODBC 1.00.7.0.58439) Microsoft

HANA 1.0 rev 70 Linux 64 bits HDBPDBC sp7 rev 0 (HDBODBC 1.00.7.0.58439) unixODBC 2.2.14


HDBPDBC sp7 rev 73 (HDBODBC 1.00.73.00) Microsoft

HANA 1.0 rev 73 Windows 64 bits HDBPDBC sp7 rev 73 (HDBODBC 1.00.73.00) Microsoft

HANA 1.0 rev 73 Linux 64 bits HDBPDBC sp7 rev 73 (HDBODBC 1.00.73.00) unixODBC 2.2.14

HANA 1.0 rev 73 Solaris 11 sparc 64 bits

HDBPDBC sp7 rev 73 (HDBODBC 1.00.73.00) unixODBC 2.3

HANA 1.0 rev 73 AIX 7.1 64 bits HDBPDBC sp7 rev 82 (HDBODBC 1.00.73.00) unixODBC 2.2.14

HANA 1.0 rev 82 Windows 64 bits HDBPDBC sp7 rev 82 (HDBODBC 1.00.82.00) Microsoft

HANA 1.0 rev 82 Linux 64 bits HDBPDBC sp7 rev 82 (HDBODBC 1.00.82.00) unixODBC 2.2.14

HANA 1.0 rev 82 Solaris 11 sparc 64 bits

HDBPDBC sp7 rev 82 (HDBODBC 1.00.82.00) unixODBC 2.3

Hive 12 – Server Windows 32 bits (1)

DataDirect 7.1 SP3 (SAP InfiniteInsight Data Access for Hive)

Microsoft

Hive 12 – Server Windows 64 bits (1)


Microsoft


Hive 12 – Server Linux X86 64 bits DataDirect 7.1 SP3 (SAP InfiniteInsight Data Access for Hive)

DataDirect

Hive 12 – Server Solaris 11 sparc 64 bits


DataDirect

Hive 12 – Server AIX 7.3 64 bits DataDirect 7.1 SP3 (SAP InfiniteInsight Data Access for Hive)

DataDirect

MySQL 5.0.46 Linux X86 64 bits ODBC connector 5.1.5 unixODBC 2.2.14

MySQL 5.0.46 Windows 32 bits (1) MyODBC 3.51.25.00 Microsoft

MySQL 5.0.46 Windows 64 bits (1) MyODBC 3.51.23.00 Microsoft

Oracle 10.02.0040 Windows 32 bits (1)

DataDirect 7.1 SP3 (SAP InfiniteInsight Data Access for Oracle)

Microsoft

Oracle 10.02.0040 Windows 64 bits (1)


Microsoft

Oracle 10.02.0040 Linux 64 bits DataDirect 7.1 SP3 (SAP InfiniteInsight Data Access for Oracle)

DataDirect

Oracle 10.02.00.40

Solaris 11 sparc 64 bits


DataDirect

Oracle 10.02.0040 Windows 32 bits (1) Oracle 10.02.00.01 Microsoft

Oracle 10.02.0040 Windows 64 bits (1) Oracle 11.02.00.03 Microsoft

Oracle 11g 11.02.0010

Windows 64 bits (1) Oracle 11.02.00.03 Microsoft

Oracle 11g 11.02.0010 Linux 64 bits


DataDirect

Oracle 11g 11.02.0010

Solaris 11 sparc 64 bits


DataDirect

Oracle 11g 11.02.0010 AIX 7.3 64 bits


DataDirect

PostgreSQL 8.4.21 Windows 32 bits (1) PostgreSQL 9.00.03.10 Microsoft

PostgreSQL 8.4.21 Windows 64 bits (1) PostgreSQL 9.00.03.10 Microsoft

PostgreSQL 9.00.0200 Linux 64 bits postgresql90-odbc-09.00.020-1PG

D.rhel5.x86_64 UnixOBDC 2.2.14

SQLServer 2005 Windows 32 bits (1) Microsoft SQL Native client 2005.90 Microsoft

SQLServer 2005 Windows 64 bits (1) SQL Server Native Client 10.0 (2) Microsoft

SQLServer 2005 Windows Server 2003 R2 64 bits SQL Server Native Client 10.0 (2) Microsoft




SQLServer 2008 Linux 64 bits Microsoft ODBC Driver 11 for SQL Server unixODBC


SQLServer 2012 Linux 64 bits Microsoft ODBC Driver 11 for SQL Server unixODBC

Sybase IQ 15.4 Windows 32 bits (1) Sybase IQ IQ 12.00.0.6567 Microsoft

Sybase IQ 15.4 Windows 64 bits (1) Sybase IQ IQ 12.00.0.6567 Microsoft

Teradata 13.1 (3) Linux 64 bits DataDirect 7.1 SP3 (SAP InfiniteInsight Data Access For Teradata)

DataDirect

Teradata 13.1 (3) Windows 32 bits (1)

DataDirect 7.1 SP3 (SAP InfiniteInsight Data Access For Teradata)

Microsoft

Teradata 13.1

Windows 64 bits (1)

DataDirect 7.1 SP3 (SAP InfiniteInsight Data Access For Teradata)

Microsoft

Teradata 13.1 Windows 32 bits (1) Teradata TTUF 15.0 Microsoft


Teradata 13.1 Linux 64 bits Teradata TTUF 14.1 Teradata

Teradata 13.1 Solaris 11 sparc 64 bits Teradata TTUF 14.1 Teradata

Teradata 13.1 AIX 7.3 Teradata TTUF 14.1 Teradata







Vertica 07.00.0001

Windows 32 bits (1) Vertica ODBC Driver 4.01.13.00 Microsoft

Vertica 07.00.0001

Windows 64 bits (1) Vertica ODBC Driver 4.01.13.00 Microsoft

Vertica 07.00.0001 Linux 64 bits Vertica ODBC Driver 4.1.09 UnixODBC 2.2.14


Notes

(1) DBMS connectivity on Windows 8 & 8.1 is not tested.

(2) Connectivity on SQLServer 2008/2012 needs either the SQL Server native client 10.0 ODBC driver or the SQL Server native client 11.0 ODBC driver. Alternative SQL Server ODBC drivers are not supported by SAP InfiniteInsight®.

(3) All connections to Teradata using InfiniteInsight® Data Access for Teradata need standard Teradata client packages: Tdicu, TeraGSS, cliv2. These packages can be found in standard Teradata Tools and Utility Files CDs. TTUF82, TTUF12, TTUF13,TTUF13.1,TTUF 14.0, TTUF 14.1,TTUF 15.0 are compatibles.

CUSTOMER SAP InfiniteInsight® 7.0 SP1 59 © 2014 SAP SE or an SAP affiliate company. All rights reserved- Flat Files

22 Flat Files

22.1 Supported Data Formats

The following standard formats are supported:

TAB-delimited files. Comma Separated Values (CSV) files (English CSV). Due to some localization issues, you must be

aware that the CSV format is not always safe. The true delimiter of a CSV file may depend of the machine language on some platforms. For that reason, we recommend using TAB-delimited file.

The TAB-delimited files should comply with the following specifications:

ASCII file Each line contains a record of values. Values in a line are separated by a TAB character (tabulation, ASCII code 9). The first line of the file contains the name of the variables. Variable names should not include the special character slash (/). Variable names should be unique (2 variables should not have the same name). Number should use the English convention for decimal point (decimal point is ‘.’) String may be enclosed in quote characters (single or double quotes); this is not mandatory. The lines should always contain the same number of TAB characters. Dates, if any, should be represented through the following format: “YYYY-MM-DD” (ISO 8601 format).

Note that SAP InfiniteInsight® provides a date coding feature that automatically extracts date information such as “day of week”, “number of years, or days, since this date”, and so on in order to improve the models. Example of valid date: 2000-03-24

Here is a simple example of how such a data file could look (columns are aligned here, they are in fact just separated by a TAB character):

age Workclass Education Educ-Level Gain class

39 State-gov Bachelors 13 2174.00 0

50 Self-emp-not-inc Bachelors 13 0.00 0

38 Private HS-grad 9 0.00 0

53 Private 11th 7 0.00 0

28 Private Bachelors 13 0.00 0

37 Private Masters 14 0.00 0

49 Private 9th 5 0.00 0

52 Self-emp-not-inc HS-grad 9 0.00 1

31 Private Masters 14 14084.00 1

42 Private Bachelors 13 5178.00 1

CUSTOMER SAP InfiniteInsight® 7.0 SP1 60 © 2014 SAP SE or an SAP affiliate company. All rights reserved- Flat Files

22.2 Note about Date and Datetime Variables

Internally in SAP InfiniteInsight® all dates are converted to datetime. This allows comparing and mixing dates with different formats, either date or datetime.

Duration computations also follow this behavior. When performing Event Log Aggregation or Sequence Analysis, the periods defined in the settings (such as "3 periods of 2 weeks before the reference date") are converted as bounds of datetime ranges.

When a date is converted to datetime, the time is set by default to noon (12:00), instead of midnight (0:00). This is to avoid problems when converting back to date from datetime (as a one second delta may change the date by one day). If you look at a table containing date values that have been forced to datetime, you will see the dates with a time set to 12:00:00.

Tip In the user interface, to indicate a datetime compatible with a date value, enter it with the time set to

noon (12:00:00).

CUSTOMER SAP InfiniteInsight® 7.0 SP1 61 © 2014 SAP SE or an SAP affiliate company. All rights reserved- SAS Files

23 SAS Files

23.1 Supported Data Formats

The following data formats are supported by SAP InfiniteInsight®:

Format Extension Version

SAS for Windows and OS/2 SD2 SAS7BDAT 6/7/8/9

SAS for Unix SSD* SAS7BDAT 6/7/8/9

SAS CPORT STC 6/7/8/9

SAS Transport Files XPT TPT 6/7/8/9

Note Compressed versions of these formats cannot be read by SAP InfiniteInsight®.

CUSTOMER SAP InfiniteInsight® 7.0 SP1 62 © 2014 SAP SE or an SAP affiliate company. All rights reserved- Annex

24 Annex

IN THIS CHAPTER

Open Source Software Used in InfiniteInsight® .................................................................................................... 63 List of Available Binaries ....................................................................................................................................... 64


24.1 Open Source Software Used in InfiniteInsight®

InfiniteInsight® uses third-party software that allows powering the processes or improving the GUI manipulation. For more information, refer to the delivered document Third-party Software Delivered with InfiniteInsight®.


24.2 List of Available Binaries

The following table lists the binaries available for each platform.

Platform OS Binary Processors

Windows (32-bit) Windows (7 SP1, 8) Microsoft Visual Studio 2005 Intel 32-bit (X86)

Windows (64-bit) Windows (7 SP1, 8, 8.1, Server 2008 R2) Microsoft Visual Studio 2005 Intel 64-bit (X64)

Sun SPARC Solaris 11.1 and later versions Sun CC 5.10 SPARC v9 (64-bit)

IBM AIX 7.1.0 TL3 and later versions xlC 12.1 PowerPC 64-bit

RedHat ES6 RedHat Enterprise Linux 6.4 and later versions

gcc++ 4.3 Intel 64-bit (X64)

SuSE 11 SuSE Linux Enterprise Server 11 (x86_64)

Patch Level 3

gcc++ 4.3 Intel 64-bit (X64)

www.sap.com/contactsap

© 2014 SAP SE or an SAP affiliate company. All rights reserved.

No part of this publication may be reproduced or transmitted in any form or for any purpose without the express permission of SAP SE or an SAP affiliate company. The information contained herein may be changed without prior notice.

Some software products marketed by SAP SE and its distributors contain proprietary software components of other software vendors. National product specifications may vary.

These materials are provided by SAP SE or an SAP affiliate company for informational purposes only, without representation or warranty of any kind, and SAP or its affiliated companies shall not be liable for errors or omissions with respect to the materials. The only warranties for SAP or SAP affiliate company products and services are those that are set forth in the express warranty statements accompanying such products and services, if any. Nothing herein should be construed as constituting an additional warranty.

SAP and other SAP products and services mentioned herein as well as their respective logos are trademarks or registered trademarks of SAP SE (or an SAP affiliate company) in Germany and other countries. All other product and service names mentioned are the trademarks of their respective companies.

Please see

for additional trademark information and notices.

http://www.sap.com/corporate-en/legal/copyright/index.epx

http://www.sap.com/corporate-en/legal/copyright/index.epx

Documents

Technical Specifications 7.0 SP1