39
7/15/2019 0603 - Engines BFLandPAL http://slidepdf.com/reader/full/0603-engines-bflandpal 1/39 © 2012 SAP AG. All rights reserved. 1 Ramp-Up Knowledge Transfer Customer November 2012 SAP HANA: Business Function Library (BFL) and Predictive Analysis Library (PAL)

0603 - Engines BFLandPAL

Embed Size (px)

DESCRIPTION

0603 - Engines BFLandPAL

Citation preview

Page 1: 0603 - Engines BFLandPAL

7/15/2019 0603 - Engines BFLandPAL

http://slidepdf.com/reader/full/0603-engines-bflandpal 1/39

© 2012 SAP AG. All rights reserved. 1Ramp-Up Knowledge Transfer Customer

November 2012

SAP HANA: Business Function Library (BFL) and

Predictive Analysis Library (PAL)

Page 2: 0603 - Engines BFLandPAL

7/15/2019 0603 - Engines BFLandPAL

http://slidepdf.com/reader/full/0603-engines-bflandpal 2/39

© 2012 SAP AG. All rights reserved. 2Ramp-Up Knowledge Transfer Customer

 Agenda

1. Overview

2. Business Function Library

3. Predictive Analysis Library

Page 3: 0603 - Engines BFLandPAL

7/15/2019 0603 - Engines BFLandPAL

http://slidepdf.com/reader/full/0603-engines-bflandpal 3/39

Overview

Page 4: 0603 - Engines BFLandPAL

7/15/2019 0603 - Engines BFLandPAL

http://slidepdf.com/reader/full/0603-engines-bflandpal 4/39

© 2012 SAP AG. All rights reserved. 4Ramp-Up Knowledge Transfer Customer

 Application Function Library (AFL)

Application Functions (C++)

SQLScript

SQLScript

HANA Clients (App Server, Analytics Technology, etc)

HANA Clients (App Server, Analytics Technology, etc)

SAP HANA

Business FunctionLibrary

Business FunctionLibrary

RLANG

RLANG

Predictive AnalysisLibrary

Predictive AnalysisLibrary

LLANG

LLANG AFLLANG

AFLLANG

AFLLANG Generator

AFLLANG Generator

AFL Framework

AFL Framework

 AFL Technology inc ludes:

AFL Framework On demand library loading,

Independent make process of whole kernel

AFLLANG

New language type similar to R and L

Special implementation of L procedures

AFLLANG Generator Users will not create AFL procedure through

AFLLANG, instead through a generator

Generator is a pre-defined common SQLScriptprocedure, which will be ready when system is up.

Users need to have proper permissions

Application Functions Written in C++and delivered as AFL content

PAL (Predictive Analysis Library) andBFL (BusinessFunction Library) will be released in SPS05 as AFLcontent.

Page 5: 0603 - Engines BFLandPAL

7/15/2019 0603 - Engines BFLandPAL

http://slidepdf.com/reader/full/0603-engines-bflandpal 5/39

Business Function Library

Page 6: 0603 - Engines BFLandPAL

7/15/2019 0603 - Engines BFLandPAL

http://slidepdf.com/reader/full/0603-engines-bflandpal 6/39

© 2012 SAP AG. All rights reserved. 6Ramp-Up Knowledge Transfer Customer

“ Run Smarter Faster”

Business Function Library (BFL)

• Compiled analytic function library for business functionality in HANA SP5

• Support various pre-built, parameter-driven algorithms

• Embedded into calculation engine

Compute Quickly

Reuse common business functionalities without developingthem.

Perform functions in real-time with high-performance computationin-memory

Help Customers To

Bring decision support capabilities to the business users throughsimplified experience and pre-built scenarios

Empower the business

Built applications Quickly

Page 7: 0603 - Engines BFLandPAL

7/15/2019 0603 - Engines BFLandPAL

http://slidepdf.com/reader/full/0603-engines-bflandpal 7/39

© 2012 SAP AG. All rights reserved. 7Ramp-Up Knowledge Transfer Customer

BFL in SAP HANA

SAP Business Suite Third-party systems

Real-time analytics

SAP HANA

Microsoft ExcelSAP Business Objects

SolutionsOthers…(Open)

Real-timereplication services Data services

Real-time apps

In-memorydatabase

Planning andcalculation engine

Business Functionlibraries

Predictive AnalysisLibrary

InformationComposer Modeling Studio

Applicationservices

Page 8: 0603 - Engines BFLandPAL

7/15/2019 0603 - Engines BFLandPAL

http://slidepdf.com/reader/full/0603-engines-bflandpal 8/39

© 2012 SAP AG. All rights reserved. 8Ramp-Up Knowledge Transfer Customer

BFL Example: Cycles Function

•Definition: This function calculates seasonal factors by using Fourier coefficients. Itcombines sine and cosine waves to help you determine seasonality or othercyclical business factors.

•ParametersInput/ Output Parameter Description

Amplitude Input Field Item Amplitude of sine/cosine.

Length Input Field Item Length (in years) over which the cycle repeats

itself.

Startdate Input Field Item Time in years at which the cycle starts.

Function Input Field Item 0 for a sine wave, and 1 for a cosine wave.

 Time Input F ield Item Time periods.

Result Output Field Item Result table that contains the expected result.

Page 9: 0603 - Engines BFLandPAL

7/15/2019 0603 - Engines BFLandPAL

http://slidepdf.com/reader/full/0603-engines-bflandpal 9/39

© 2012 SAP AG. All rights reserved. 9Ramp-Up Knowledge Transfer Customer

BFL Example: Cycles Function

•Syntax:

 Table Preparation Execute statement

Page 10: 0603 - Engines BFLandPAL

7/15/2019 0603 - Engines BFLandPAL

http://slidepdf.com/reader/full/0603-engines-bflandpal 10/39

© 2012 SAP AG. All rights reserved. 10Ramp-Up Knowledge Transfer Customer

BFL: Functions -1-

Function Description

Annual

Depreciation

Calculates annual depreciation according to three common

methods: Diminishing balance depreciation, Straight line

depreciation and Sum-of-year depreciation. It allows variable length

of timescales forall assets/items.

Cumulate Calculates the cumulative totals in one row based on the original

numbers in another row.

Cycles Calculates seasonal factors from Fourier coefficients. It combines

sine and cosine waves to help you determine seasonality or other

cyclical business factors.Days Returns the number of days in each period defined by each pair of 

From and To dates.

Days Outstanding Calculates receipts or payments based on the level of days

outstanding.

De-cumulate Calculates the original series starting from the cumulated totals.

Delay Calculates receivables or payables based on a delay between the

time of invoice and the time of payment.

Delay Debt Calculates cash receipts using actual sales. The closing debtor

balance for each period is calculated by referring to historic sales

levels for a specified number of days.

Delay Stock Calculates purchases required to meet future demand.

Discounted Cash

Flow

Converts a future stream of cash flow to constant prices. It

calculates the inflated value of today's money.

Driver Calculates the forecast for future periods using historical data and

as many drivers as needed. A driver drives cost, such as

headcount, floor space, units sold, and unit price.

Feed Calculates the closing balance and "feeds" it to the opening balance

of the next time period.

Feed Overflow Calculates the closing balance and feeds it to the opening balance

of the next period.

Function Description

Forecast Combines actual and forecast data to produce a rolling

forecast. Eliminates scripting of feeds.

Forecast Agents A specialized version of the Driver function focused on the

entities required to meet service levels. Used primarily for

labor in areas like call centers and mortgage processing based

on interest rate.

Forecast Driver A specialized version of the Driver function that calculates the

forecast for future periods using historical data and one single

driver.

Forecast DualDriver

Calculates the forecast for future periods using historical dataand two drivers. It also calculates the incremental effect of 

each driver on the historical base figure.

Forecast Mix Mixes actual data prior to the SwitchOver date with forecast

data on and after the switchover date.

Forecast

Sensitivity

Returns a calculation for the proportion of requests that will be

queued because there are no agents available when the

request was answered.

Funds Calculates the use of funds or the source of funds.

Future Calculates the closing balance of an account given the start

balance and the conditions under which the account runs.

Grow Grows a base figure by a specified percentage each period. It

can be compound or linear.

Inflated Cash

Flow

Calculates the amount of cash you must receive in a future

period to compensate for inflation.

Internal Rate of 

Return

Calculates the Internal Rate ofReturn for a series of cash flow

onspecifieddates.

Lag Calculates a result in one rowby lagging aninputfromanother

row bya specifiednumberof periods.

Last Looks back the series of data of the input row and returns the

most recent non-zero value.

Page 11: 0603 - Engines BFLandPAL

7/15/2019 0603 - Engines BFLandPAL

http://slidepdf.com/reader/full/0603-engines-bflandpal 11/39

© 2012 SAP AG. All rights reserved. 11Ramp-Up Knowledge Transfer Customer

BFL: Functions -2-

Function Description

Lease Calculates a payment schedule for a lease, loan,

mortgage, annuity or savings account.

Lease

Variable

Allows an account to be scheduled along a time

scale representing the life of the loan.

Linear

Average

Calculates a linear average that applies a larger

weight tomore recent periods. The weights applied

decrease linearlyas timegoes back.

Max Value Returns the maximum value of a range.

Minimum

Value

Returns the minimum value of a specific range.

Moving

Average &

Moving Sum

Calculate a moving average or moving sum over

specified periods. Key statistical component

Moving

Median

 Takes the median value after sorting all input values

into ascending sequence.

Number of 

Periods

Calculates the number of periods over which the

account must run.

Net Present

Value

Calculates the sum of a series of future cashflow

values after discounting each to a present value

based on the annual rate input for the period in

which it is being calculated.

Outlook  The Outlook is calculated by using actuals of past

months and plan figures of future months.

Payment Calculates the regular payment to an account for

each period.

Function Description

Present Value Calculates opening value through the given target

closing balance and various parameters.

Proportion Allows you to input a start and end date, and then

calculates the proportion of the period length. Important

for project planning with performance to plan

calculations

Rate Calculates the percentage interest rate per period for an

account, given its start balance, end balance, payment

amount per period and the number of periods.

Repeat It is used to repeat data from a single period or group of 

periods through the time scale of the Dimension List.

Rounding Calculates the rounded values for a specified input item

according to a chosen rounding method.

Seasonal

Complex

Performs seasonal adjustments of time to determine

seasonal patterns in data.

Seasonal

Simple

Performs seasonal adjustments of time to determine

seasonal patterns in data.

Seasonal

Simulation

Provides the building blocks to Seasonal Simulation

seasonal data using a variety of characteristics.

Stock Flow Works out the level of supply needed to meet target

forecasts for stock cover.

Stock Flow

Reverse

Allows you to input stock cover and work out what

purchases were needed to meet the target stock levels.

Stock Flow

Batch

Let’s you use batch quantities in Stock flow calculations.

Key for constraint based models or non-discrete

manufacturing units of measure .

Page 12: 0603 - Engines BFLandPAL

7/15/2019 0603 - Engines BFLandPAL

http://slidepdf.com/reader/full/0603-engines-bflandpal 12/39

© 2012 SAP AG. All rights reserved. 12Ramp-Up Knowledge Transfer Customer

BFL: Functions -3-

Function Description

 Year over

 Year

Difference

Calculates the Year over Year Difference between

the current and previous time periods.

 Year to Date Calculates year to date totals based on original data.

 Year to Date

Statistical

Calculates the original numbers in one row based on

the year-to-date figures in another row.

Page 13: 0603 - Engines BFLandPAL

7/15/2019 0603 - Engines BFLandPAL

http://slidepdf.com/reader/full/0603-engines-bflandpal 13/39

Predictive Analysis Library

Page 14: 0603 - Engines BFLandPAL

7/15/2019 0603 - Engines BFLandPAL

http://slidepdf.com/reader/full/0603-engines-bflandpal 14/39

© 2012 SAP AG. All rights reserved. 14Ramp-Up Knowledge Transfer Customer

“ Run Smarter Faster”

Predictive Analysis Library (PAL)

• Compiled analytic function library for predictive analysis in HANA SPS05

• Support multiple algorithms: K-Means, Association Analysis, C4.5 Decision Tree,Multiple Linear Regression, Exponential Smoothing…

Know Your Business

Decide with Confidence

Compute Quickly

Uncover deep insights & patterns about the business: associationrules, customer clustering, or sales prediction

Drive more advanced analyses. Decision is made with supportfrom analysis numbers

Query and analyze data in real-time with high-performancecomputation in-memory

Help Customers To

Bring decision support capabilities to the business users throughsimplified experience and pre-built scenarios

Empower the business

Page 15: 0603 - Engines BFLandPAL

7/15/2019 0603 - Engines BFLandPAL

http://slidepdf.com/reader/full/0603-engines-bflandpal 15/39

© 2012 SAP AG. All rights reserved. 15Ramp-Up Knowledge Transfer Customer

PAL in SAP HANA

SAP Business Suite Third-party systems

Real-time analytics

SAP HANA

Microsoft ExcelSAP Business Objects

SolutionsOthers…(Open)

Real-timereplication services Data services

Real-time apps

In-memorydatabase

Planning andcalculation engine

Business Functionlibraries

Predictive AnalysisLibrary

InformationComposer Modeling Studio

Applicationservices

Page 16: 0603 - Engines BFLandPAL

7/15/2019 0603 - Engines BFLandPAL

http://slidepdf.com/reader/full/0603-engines-bflandpal 16/39

© 2012 SAP AG. All rights reserved. 16Ramp-Up Knowledge Transfer Customer

PAL - Algorithms

Association Analysis

Apriori

Apriori Lite

Cluster Analysis

K-Means

Kohonen Self Organized Maps *

Classification Analysis

C4.5 Decision Tree Analysis

CHAID Decision Tree Analysis

K Nearest Neighbour

Multiple Linear Regression

Polynomial Regression *

Exponential Regression

Bi-Variate Geometric Regression

Bi-Variate Logarithmic Regression

Logistic Regression

* New in SPS05

 Time Series Analysis

Single Exponential Smoothing

Double Exponential Smoothing

 Triple Exponential Smoothing

Outlier Detection

Inter-Quartile Range Test (Tukey’s Test)

Variance Test *

Anomaly Detection *

Data Preparation

Sampling *

Binning *

Scaling *

Other

ABC Classification

Weighted Scores Table

Page 17: 0603 - Engines BFLandPAL

7/15/2019 0603 - Engines BFLandPAL

http://slidepdf.com/reader/full/0603-engines-bflandpal 17/39

© 2012 SAP AG. All rights reserved. 17Ramp-Up Knowledge Transfer Customer

 Association Analysis

Definition: find the most frequent associations in a dataset.

Applications

Clearly - shopping carts and supermarket shoppers

Analysis of any product purchases… not just in shops

Analysis of telecom service purchasesAnalysis of telephone calling patterns

 The ‘basket’ can be a household

Identification of fraudulent medical insurance claims - consider cases wherecommon rules are broken.

Differential analysis - compare results between different stores, betweencustomers in different demographic groups, between different days of theweek, different seasons of the year, etc.

Page 18: 0603 - Engines BFLandPAL

7/15/2019 0603 - Engines BFLandPAL

http://slidepdf.com/reader/full/0603-engines-bflandpal 18/39

© 2012 SAP AG. All rights reserved. 18Ramp-Up Knowledge Transfer Customer

 Association Analysis

We use it in our everyday lives

Page 19: 0603 - Engines BFLandPAL

7/15/2019 0603 - Engines BFLandPAL

http://slidepdf.com/reader/full/0603-engines-bflandpal 19/39

© 2012 SAP AG. All rights reserved. 19Ramp-Up Knowledge Transfer Customer

 Association Analysis

Tran sact io n ID Reco rd s

0001 iPhone4s, Protector 

0002 iPhone4s, Earphone, Protector 

0003 iPhone4s, Protector 

0004 Earphone

0005 iPad, iPhone4s

Item1 Item 2 Support Confidence

iPhone4s Protector  3 / 5 =60% 3 / 4 =75%

Support – The association (iPhone4s ->Protector) canbe found in 3 out of 5 =60% of the transactions.

Confidence – When a customer buys a iPhone4s, theyalso buy a Protector 3 out of 4 =75% of the time.

Example

Page 20: 0603 - Engines BFLandPAL

7/15/2019 0603 - Engines BFLandPAL

http://slidepdf.com/reader/full/0603-engines-bflandpal 20/39

© 2012 SAP AG. All rights reserved. 20Ramp-Up Knowledge Transfer Customer

Cluster Analysis

Definition: Cluster analysis looks for clusters or grouping of objects

Applications

Customer segmentation

Data reduction / problem refinement when faced with large, complex data sets

Market segmentation and determining target markets

Product positioning

 Test markets selection

Crime pattern analysis

Medical research, social services, education, criminology, and so on

Anomaly detection (converse of segmentation)

…medical research, social services, psychiatry, education, archaeology,astronomy, taxonomy…

Page 21: 0603 - Engines BFLandPAL

7/15/2019 0603 - Engines BFLandPAL

http://slidepdf.com/reader/full/0603-engines-bflandpal 21/39

© 2012 SAP AG. All rights reserved. 21Ramp-Up Knowledge Transfer Customer

K-Means algorithm is used for partition a data set into K clusters. It is a verypopular cluster algorithm.

Kohonen Self Organizing Maps are a type of neural network that performclustering. When the network is fully trained, records that are similar shouldappear close together on the output map, while records that are different willappear far apart. This may give you a sense of the appropriate number of clusters.

Cluster Analysis

K Means on the Iris data set Kohonen Self Organizing Map

Page 22: 0603 - Engines BFLandPAL

7/15/2019 0603 - Engines BFLandPAL

http://slidepdf.com/reader/full/0603-engines-bflandpal 22/39

© 2012 SAP AG. All rights reserved. 22Ramp-Up Knowledge Transfer Customer

Definition: A classification is a model to define the relationships betweeninputs and an output. The output, in statistics referred to as the dependentvariable, is a function of one or more inputs, the independent variables. Weuse known inputs and outputs to define a model, and then use the model topredict or ‘score’ unknown values. This is sometimes referred to assupervised learning or directed data mining.

Classification algorithms can be sub-divided into:

Decision Tree algorithms CNR Tree is one of the most well known. CHAID analysis and C 5.0 are also popular.

Regression algorithms Multiple Linear Regression is the most well known

Neural Network algorithms  These are defined in terms of their ‘topology’ e.g. MLP, RBF…

Other Support Vector Machines, K Nearest Neighbour…

Independent / Inputvariables

Dependent / Output

variable

Classification Analysis

Page 23: 0603 - Engines BFLandPAL

7/15/2019 0603 - Engines BFLandPAL

http://slidepdf.com/reader/full/0603-engines-bflandpal 23/39

© 2012 SAP AG. All rights reserved. 23Ramp-Up Knowledge Transfer Customer

A set of rules and graphical tree-shaped representation of the relationshipsbetween a dependent variable and a set of independent variables . The treemay be binary or multi-branching, depending upon the algorithm used tosegment the data. Each node represents a test of a decision.

 There are many use cases for decision tree analysis

Determining the best targets for a mail shot campaign Churn analysis

Profiling high income earners from census data

Identifying spam

Loan applicant creditworthiness

Classification Analysis – Decision Tree Algorithms

Page 24: 0603 - Engines BFLandPAL

7/15/2019 0603 - Engines BFLandPAL

http://slidepdf.com/reader/full/0603-engines-bflandpal 24/39

© 2012 SAP AG. All rights reserved. 24Ramp-Up Knowledge Transfer Customer

Example

New Customer Profile

Income : 40000AGE : 42Gender : FemaleHouse Loan : N

Group : ?

HistoricalData

C4.5Decision Tree

Rules for A Class(The most important

customers)

Classification Analysis – Decision Tree Algorithms

Page 25: 0603 - Engines BFLandPAL

7/15/2019 0603 - Engines BFLandPAL

http://slidepdf.com/reader/full/0603-engines-bflandpal 25/39

© 2012 SAP AG. All rights reserved. 25Ramp-Up Knowledge Transfer Customer

In statistics, regression analysis is a collective name for techniques for themodelling and analysis of numerical data consisting of values of a dependentvariable (also called response or target) and of one or more independent variables(also known as explanatory variables or predictors).

 The dependent variable in the regression equation is modelled as a function of theindependent variables, corresponding parameters ("constants"), and an error term.

 The error term is treated as a random variable. It represents unexplained variationin the dependent variable.

 The parameters are estimated so as to give a "best fit" of the data. Mostcommonly the best fit is evaluated by using the least squares method, but othercriteria can also been used.

Classification Analysis – Regression Algorithms

Page 26: 0603 - Engines BFLandPAL

7/15/2019 0603 - Engines BFLandPAL

http://slidepdf.com/reader/full/0603-engines-bflandpal 26/39

© 2012 SAP AG. All rights reserved. 26Ramp-Up Knowledge Transfer Customer

 The PAL supports :

Multiple Linear Regression Y= a + b*X’ + c*X’’ + d*X’’’ …

Polynomial Regression Y= a + b*X + c*X2 + d*X3 …

Exponential Regression Y = a*bX

Bi-Variate Geometric Regression Y = a*Xb

Bi-Variate Logarithmic Regression Y =a +b * log(X)

Logistic Regression

Classification Analysis – Regression Algorithms

Page 27: 0603 - Engines BFLandPAL

7/15/2019 0603 - Engines BFLandPAL

http://slidepdf.com/reader/full/0603-engines-bflandpal 27/39

© 2012 SAP AG. All rights reserved. 27Ramp-Up Knowledge Transfer Customer

Exponential smoothing is a method of forecasting that uses weighted valuesof previous series observations to predict future values. The principle beingthat the older the data points, the less importance they should be given.

Single or Simple Exponential Smoothing – a weighted average of the past

Ft+1 = Xt+ (1- ) Xt-1 + (1- )2 Xt-2 + (1- )3 Xt-3 … + (1- )N Xt-N

where is a smoothing constant 0 < <1

Example: if  is 0.1, then the weights are 0.1, 0.09, 0.081, 0.0729… If  is 0.5, then the weights are 0.5, 0.25, 0.125, 0.0625… If  is 0.9, then the weights are 0.9,0.81, 0.729, 0.6561…

Now the above equation can be shown to be equal to Ft+1 = Xt +(1- ) Ft

So the computation becomes very easy, but we have to start the process with the firstforecast and that is where different starting methods can lead to different fits and forecasts.

Time Series Analysis

Page 28: 0603 - Engines BFLandPAL

7/15/2019 0603 - Engines BFLandPAL

http://slidepdf.com/reader/full/0603-engines-bflandpal 28/39

© 2012 SAP AG. All rights reserved. 28Ramp-Up Knowledge Transfer Customer

 Three basic patterns: stationary, trends, seasonality. These equate to single,double and triple exponential smoothing.

Double Exponential Smoothing which applies two smoothing constants, one for the stationaryelement and the other for the trend.

Holt’s Two-Parameter Model St = Xt +(1- ) (St-1+ bt-1)… the stationary element bt =µ (St – St-1) +(1 - µ) bt-1… the trend element Ft+m = St + bt m

 Triple Exponential Smoothing – for stationary and trend and seasonality Winters’ Three-Parameter Model St = Xt / It-L +(1- ) (St-1+ bt-1) … the stationary element bt =µ (St – St-1) +(1 - µ) bt-1 … the trend It = Xt/St + (1 – ) It-L … the seasonality

Ft+m =(St + bt m)It-L+m

Time Series Analysis

0

100

200

300

400

500

600

700

800

900

1000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

0. 00

5 0 . 0 0

100. 00

150. 00

200. 00

250. 00

300. 00

350. 00

1 2 3 4 5 6 7 8 9 1 0 1 1 12 1 3 14 1 5 16 1 7 18 1 9 20 2 1 22 2 3 24 2 5 26 2 7 28 2 9

Page 29: 0603 - Engines BFLandPAL

7/15/2019 0603 - Engines BFLandPAL

http://slidepdf.com/reader/full/0603-engines-bflandpal 29/39

© 2012 SAP AG. All rights reserved. 29Ramp-Up Knowledge Transfer Customer

Outliers

Definition: An outlier  is an observation that lies an ‘abnormal’ distance from othervalues in a random sample from a population.

Outliers can occur because of measurement errors and might be removed fromthe data set or corrected.

 They can occur naturally and therefore must be treated carefully.

Some statistics / algorithms can be heavily biased by outliers. For example thesimple mean, correlation, linear regression. In contrast the trimmed mean andmedian are not so affected.

Outliers can be detected visually, for example Scatter Plots and Box Plots. -

Page 30: 0603 - Engines BFLandPAL

7/15/2019 0603 - Engines BFLandPAL

http://slidepdf.com/reader/full/0603-engines-bflandpal 30/39

© 2012 SAP AG. All rights reserved. 30Ramp-Up Knowledge Transfer Customer

Outlier Algorithms – Inter Quarti le Range Test (IQR)

Outliers can be detected using various algorithms. The most well known beingthe Inter Quartile Range Test or the Tukey Test, named after it’s author. It’s thecalculation behind the construction of Box Plots.

Given a time series X1 to Xn, calculate the upper and lower quartiles (25th and 75th percentile),denoted as UQ and LQ. Calculate the mid spread as MID =UQ - LQ. An outlier is then defined to beany observation where

Xi < LQ - n * MID or Xi > UQ + n * MID

 The value of n is usually set to 1.5, however for large time series, say more than 36 points, it isrecommended to use a value of 2. The concept of very significant and significant outliers could beintroduced by using values of n =3 and n =2 respectively.

 The PAL supports: Inter-Quartile Range Test (Tukey’s Test)

Variance Test – this is just the simple identification of values outside x standard deviations from themean

Anomaly Detection – this is conceptually the ‘reverse’ of cluster analysis. We look for values furthestaway from their nearest cluster centre, measure the absolute and percentage distance and rank thelargest ‘outliers’.

Page 31: 0603 - Engines BFLandPAL

7/15/2019 0603 - Engines BFLandPAL

http://slidepdf.com/reader/full/0603-engines-bflandpal 31/39

© 2012 SAP AG. All rights reserved. 31Ramp-Up Knowledge Transfer Customer

 The PAL supports: Sampling

First N

Middle N

Last N

Every Nth

Simple Random Sampling with replacement

Simple Random Sampling without replacement Systematic Sampling

Stratified sampling

Binning

Equal widths based on the number of bins

Equal widths based on the bin width

Equal number of records per bin

Mean / Standard Deviation bin boundaries

Scaling

Standardized Variable or Z-Score or Standard Score

Normalization

Standardized Variable - Median / Median Absolute Deviation

Normalization by Decimal Scaling

Data Preparation

Page 32: 0603 - Engines BFLandPAL

7/15/2019 0603 - Engines BFLandPAL

http://slidepdf.com/reader/full/0603-engines-bflandpal 32/39

© 2012 SAP AG. All rights reserved. 32Ramp-Up Knowledge Transfer Customer

 ABC Classification

Definition: Divide the data into 3 groups – A,B,C. In the example 60%, 30%, 10%

It’s a form of segmentation analysis – now you can examine the A group etc.

Page 33: 0603 - Engines BFLandPAL

7/15/2019 0603 - Engines BFLandPAL

http://slidepdf.com/reader/full/0603-engines-bflandpal 33/39

© 2012 SAP AG. All rights reserved. 33Ramp-Up Knowledge Transfer Customer

Weighted Score Tables

Definition: the sum of each attribute multiplied by its weight.

In the example: Calculated score for Lily:

30000 * 0.0005 + 9 * 2 + 3 * 1 = 15 + 18 + 3 = 36

Score many records for comparison and sorting

Field Weights 0.0005 2 1

Customers Income Scores Age Scores City Scores

Smith 18000 18000 20 – 29 6 Big 9

Lily 30000 30000 30 - 39 9 Small 3

 J orge 43000 43000 40 – 49 7 Medium 6

Page 34: 0603 - Engines BFLandPAL

7/15/2019 0603 - Engines BFLandPAL

http://slidepdf.com/reader/full/0603-engines-bflandpal 34/39

© 2012 SAP AG. All rights reserved. 34Ramp-Up Knowledge Transfer Customer

Working With PAL – 1-

Step 1: Generate a PAL Procedure: First you need to generate a procedure bycalling the AFL wrapper generator from SQLScript. The syntax is as follows:

CALL AFL_WRAPPER_GENEREATOR(<procedure_name>,<area_name>,<function_name>,<signature_tab>);

<procedure_name>: user-defined procedure name.

<area_name>: 'AFLPAL'. This is used for all PAL functions and cannot be changed by users.

<function_name>: PAL built-in function name.

<signature_tab>: user-defined table variable. The table contains records to describe input table type,parameter table type, and result table type. A typical table variable references a table with the followingdefinition:

Page 35: 0603 - Engines BFLandPAL

7/15/2019 0603 - Engines BFLandPAL

http://slidepdf.com/reader/full/0603-engines-bflandpal 35/39

© 2012 SAP AG. All rights reserved. 35Ramp-Up Knowledge Transfer Customer

Working With PAL – 2-

Step 2: Call a PAL Procedure: After generating a procedure, you can then call theprocedure using the below syntax.

CALL <procedure_name> (<data_input_tab> {,…},<parameter_tab>,<output_tab>{,…}) with overview;

<procedure_name>: the procedure name users defined when generating the procedure in Step 1.

<data_input_tab>: user-defined name(s) of the current procedure’s input table(s).

<parameter_tab>: user-defined name of the current procedure’s parameter table.

<output_tab>: user-defined name(s) of the current procedure’s output table(s).

Consult the documentation for more details.

Page 36: 0603 - Engines BFLandPAL

7/15/2019 0603 - Engines BFLandPAL

http://slidepdf.com/reader/full/0603-engines-bflandpal 36/39

© 2012 SAP AG. All rights reserved. 36Ramp-Up Knowledge Transfer Customer

PAL: Functions Available for GA use by Customers and

Partners in SP5 – 1-

K Means – A method of cluster analysis whereby the algorithm partitions N observations or records into K clusters in which eachobservation belongs to the cluster with the nearest center.

K Nearest Neighbor - The K-Nearest Neighbor (KNN) algorithm is a method for classifying objects based on the closest K objects

and their average classification / value.

Multiple Linear Regression (MLR) - An approach to modeling the linear relationship between a variable Y, usually referred to asthe dependent variable, and one or more other variables, usually referred to as independent variables, denoted X1, X2, X3...

C4.5 Decision Tree – A classification algorithm, C4.5 builds decision trees from a set of training data, using the concept of information entropy. The training data is a set of already classified samples. At each node of the tree, C4.5 chooses one attribute of the data that most effectively splits it into subsets in one class or the other. Its criterion is the normalized information gain(difference in entropy) that results from choosing an attribute for splitting the data. The attribute with the highest normalizedinformation gain is chosen to make the decision. The C4.5 algorithm then proceeds recursively until meeting some stopping criteriasuch as minimum number of cases in a leaf node.

CHAID Analysis - This model is similar to the C4.5 decision tree. CHAID stands forCHi-squared Automatic Interaction Detection,and is a classification method for building decision trees by using chi-square statistics to identify optimal splits. CHAID examinesthe cross tabulations between each of the input fields and the outcome, and tests for significance using a chi-square independence

test. If more than one of these relations is statistically significant, CHAID will select the input field that is the most significant(smallest p value). CHAID can generate non-binary trees

 Apr iori & Aprior i L ite - Popular association discovery algorithm commonly associated with market basket analysis. The algorithmlooks for rules to describe frequent product and other items associations. Apriori Lite is a subset of Apriori when only singleantecedent and single subsequent are required and is therefore faster.

 ABC Classi ficat ion – A dataset is divided into 3 groups – A,B,C so X% of a variable are in A, Y% in B and 100% - X – Y in C. Itcan be used for analyzing customer behavior and defining market segments.

Page 37: 0603 - Engines BFLandPAL

7/15/2019 0603 - Engines BFLandPAL

http://slidepdf.com/reader/full/0603-engines-bflandpal 37/39

© 2012 SAP AG. All rights reserved. 37Ramp-Up Knowledge Transfer Customer

PAL: Functions Available for GA use by Customers and

Partners in SP5 – 2 –

Weighted Score Tables – Each column / variable in a table is allocated a score, which may vary across its range of values, and then a weight. Each record is scored and the scores are multiplied by the weights and summed. The summedscores can then be ranked to identify the highest.

Exponential Regression - An approach to model the relationship between a variable Y and one or more variablesdenoted X1, X2, X3... In exponential regression, data are modeled using an exponential function and unknown modelparameters are estimated from the data using the criteria of least squares.

Logistic Regression - Predicts the outcome of a categorical variable (a variable that can take on a limited number of categories) based on one or more predictor variables. The probabilities describing the possible outcome are modeled as afunction of the explanatory variables, using a logistic function. It is analogous to linear regression but takes a categorical

target field instead of a numeric one.

Inter-Quartile Range Test - Given a series of numeric data, the Inter-Quartile Range is the difference between 3rd-quartile(Q3) and 1st-quartile(Q1) of that data series. Values which are several multiples of the IQR from the median areidentified as outliers.

Bi-Variate Geometric Regression - An approach to model the relationship between a dependent numeric variable Y andan independent numeric variable X. In geometric regression, data are modeled using a geometric function, and unknownmodel parameters are estimated from the data using least squares regression.

Bi-Variate Natural Logarithmic Regression – An approach to model the relationship between a dependent numeric

variable Y and an independent numeric variable X. In geometric regression, data are modeled using a natural logarithmicfunction, and unknown model parameters are estimated from the data using least squares regression.

Single, Double, Triple Exponential Smoothing - Techniques that can be applied to time series data, either to producesmoothed data for presentation, or to make forecasts. Single smoothing is used when the time series is stationary, doublewhen there is a trend and triple when there is seasonality. Older values in the time series are given less importance with theweights forming an exponential decay.

Page 38: 0603 - Engines BFLandPAL

7/15/2019 0603 - Engines BFLandPAL

http://slidepdf.com/reader/full/0603-engines-bflandpal 38/39

© 2012 SAP AG. All rights reserved. 38Ramp-Up Knowledge Transfer Customer

PAL: Functions Available for GA use by Customers and

Partners in SP5 – 3 –

Polynomial Regression - An approach to model the relationship between a numeric variable Y and a numeric variable X,raised to the power of 2,3,4 etc. denoted X2, X3, X4… In polynomial regression, data are modeled using polynomialfunctions, and unknown model parameters are estimated from the data using the criteria of least squares. .

Variance Test - Given a series of numeric data, the Variance Test simply calculates the variance. Values which are severalmultiples of the variance from the mean are identified as outliers.

 Anomaly Detec tion - this is conceptually the ‘reverse’ of cluster analysis. We look for values furthest away from theirnearest cluster centre, measure the absolute and percentage distance and rank the largest ‘anomalies’ or outliers.

Sampling – An aspect of statistics concerned with the selection of an unbiased or random subset of individual observations

within a population of individuals intended to yield some knowledge about the population of concern, especially for thepurposes of making predictions based on statistical inference.

Binning – A common requirement prior to running certain predictive algorithms. It generally reduces the complexity of themodel, for example the model in a decision tree can become very complex if every value of a numeric variable becomes abranch in the tree. Binning methods smooth a sorted data value by consulting its “neighborhood”, that is, the values aroundit. The sorted values are distributed into a number of “buckets” or bins.

Scaling - This function is used where the data is to be scaled to fall within a specified range, such as -1.0 to 1.0, or 0.0 to1.0. You can normalize an attribute by scaling its values to make them fall within a specified range. Normalization isparticularly useful for classification algorithms involving neural networks, or distance measurements such as nearest-

neighbor classification and clustering. This PAL algorithm includes three data normalization methods: min-max, z-score,and decimal scaling.

Kohonen Self Organized Maps - A type of artificial neural network that is trainedusingunsupervised learning to produce alow-dimensional (typically two-dimensional), discretized representationof the input space of the training samples, calleda map. Self-organizing maps are different to other artificial neural networks in the sense that they use a neighborhoodfunction to preserve the topological properties of the input space.

Page 39: 0603 - Engines BFLandPAL

7/15/2019 0603 - Engines BFLandPAL

http://slidepdf.com/reader/full/0603-engines-bflandpal 39/39

© 2012 SAP AG. All rights reserved. 39Ramp-Up Knowledge Transfer Customer

© 2012 SAP AG. Alle Rechte vorbehalten.

Weitergabe und Vervielfältigung dieser Publikation oder von Teilen daraus sind, zuwelchem Zweck und in welcher Form auch immer, ohne die ausdrückliche schriftlicheGenehmigung durch SAP AG nicht gestattet. In dieser Publikation enthaltene Informationenkönnen ohne vorherige Ankündigung geändert werden.

Die von SAP AG oder deren Vertriebsfirmen angebotenen Softwareprodukte könnenSoftwarekomponenten auch anderer Softwarehersteller enthalten.

Microsoft, Windows, Excel, Outlook, und PowerPoint sind eingetragene Marken derMicrosoft Corporation.

IBM, DB2, DB2 Universal Database, System i, System i5, System p, System p5, System x,System z, System z10, z10, z/VM, z/OS, OS/390, zEnterprise, PowerVM, P ower

Architecture, Power Systems, POWER7, POWER6+, POWER6, POWER, PowerHA,pureScale, PowerPC, BladeCenter, System Storage, Storwize, XIV, GPFS, HACMP,RETAIN, DB2 Connect, RACF, Redbooks, OS/2, AIX, Intelligent Miner, WebSphere, Tivoli,Informix und Smarter Planet sind Marken oder eingetragene Marken der IBM Corporation.

Linux ist eine eingetragene Marke von Linus Torvalds in den USA und anderen Ländern.

Adobe, das Adobe-Logo, Acrobat, PostScript und Reader sind Marken oder eingetrageneMarken von Adobe Systems Incorporated in den USA und/oder anderen Ländern.

Oracle und J ava sind eingetragene Marken von Oracle und/oder ihrer Tochtergesellschaften.

UNIX, X/Open, OSF/1 und Motif sind eingetragene Marken der Open Group.

Citrix, ICA, Program Neighborhood, MetaFrame, WinFrame, VideoFrame und MultiWinsind Marken oder eingetragene Marken von Citrix Systems, Inc.

HTML, XML, XHTML und W3C sind Marken oder eingetragene Marken des W3C®

,World Wide Web Consortium, Massachusetts Institute of Technology.

Apple, App Store, iBooks, iPad, iPhone, iPhoto, iPod, iTunes, Multi-Touch, Objective-C,Retina, Safari, Siri und Xcode sind Marken oder eingetragene Marken der Apple Inc.

IOS ist eine eingetragene Marke von Cisco Systems Inc.

RIM, BlackBerry, BBM, BlackBerry Curve, BlackBerry Bold, BlackBerry Pearl, BlackBerry Torch, BlackBerry Storm, BlackBerry Storm2, BlackBerry PlayBook und BlackBerry AppWorld sind Marken oder eingetragene Marken von Research in Motion Limited.

Google App Engine, Google Apps, Google Checkout, Google Data API, Google Maps,Google Mobile Ads, Google Mobile Updater, Google Mobile, Google Store, Google Sync,Google Updater, Google Voice, Google Mail, Gmail, YouTube, Dalvik und Android sindMarken oder eingetragene Marken von Google Inc.

INTERMEC ist eine eingetragene Marke der Intermec Technologies Corporation.

Wi-Fi ist eine eingetragene Marke der Wi-Fi Alliance.

Bluetooth ist eine eingetragene Marke von Bluetooth SIG Inc.

Motorola ist eine eingetragene Marke von Motorola Trademark Holdings, LLC.

Computop ist eine eingetragene Marke der Computop Wirtschaftsinformatik GmbH.

SAP, R/3, SAP NetWeaver, Duet, PartnerEdge, ByDesign, SAP BusinessObjects Explorer,StreamWork, SAP HANA und weitere im Text erwähnte SAP-Produkte und -Dienstleistungen sowie die entsprechenden Logos sind Marken oder eingetragene Markender SAP AG in Deutschland und anderen Ländern.

Business Objects und das Business-Objects-Logo, BusinessObjects, Crystal Reports,Crystal Decisions, Web Intelligence, Xcelsius und andere im Text erwähnte Business-Objects-Produkte und Dienstleistungen sowie die entsprechenden Logos sind Markenoder eingetragene Marken der Business Objects Software Ltd. Business Objects ist einUnternehmen der SAP AG.

Sybase und Adaptive Server, iAnywhere, Sybase 365, SQL Anywhere und weitere im Texterwähnte Sybase-P rodukte und -Dienstleistungen sowie die entsprechenden Logos sindMarken oder eingetragene Marken der Sybase Inc. Sybase ist ein Unternehmen derSAP AG.

Crossgate, m@gic EDDY, B2B 360°

, B2B 360°

Services sind eingetragene Marken derCrossgate AG in Deutschland und anderen Ländern. Crossgate ist ein Unternehmen derSAP AG.

Alle anderen Namen von Produkten und Dienstleistungen sind Marken der jeweiligenFirmen. Die Angaben im Text sind unverbindlich und dienen lediglich zu Informations-zwecken. Produkte können länderspezifische Unterschiede aufweisen.

Die in dieser Publikation enthaltene Information ist Eigentum der SAP. Weitergabe undVervielfältigung dieser Publikation oder von Teilen daraus sind, zu welchem Zweck undin welcher Form auch immer, nur mit ausdrücklicher schriftlicher Genehmigung durchSAP AG gestattet.