52
Statistical Analysis with Oracle Marc Daniau – Product group [email protected]

Statistical Analysis With Oracle

  • Upload
    -

  • View
    15

  • Download
    0

Embed Size (px)

DESCRIPTION

Statistical Analysis With Oracle

Citation preview

Page 1: Statistical Analysis With Oracle

Statistical Analysis with Oracle

Marc Daniau – Product group [email protected]

Page 2: Statistical Analysis With Oracle

COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 2

Introduction

The ANSI SQL offers analytical capabilities useful for advanced data analysis like linear regression and correlation. Beyond supporting the standard SQL functions Oracle provides additional vendor-specific functions that enable to perform statistical analysis such as descriptive statistics and hypothesis testing. In the following presentation we will put into practice the main statistical built-in functions from Oracle 10g release 2. We will also show an example of a user-defined function.

Page 3: Statistical Analysis With Oracle

COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 3

Topics

Descriptive Statistics Linear Regression Correlation Performance measurement Distribution analysis Hypothesis Testing

Page 4: Statistical Analysis With Oracle

COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 4

Summarizing a data set

The built-in SQL functions involved in the following sample are avg, median, stats_mode, variance, stddev, min,

max and percentile_cont.

Note there is no built-in functions for skew and kurtosis; we had to use custom SQL expressions.

Page 5: Statistical Analysis With Oracle

COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 5

The BOBJ Universe

New functions can be added to the list by updating the ‘oracle.prm’ file.

Here is a sample Universe exposing the stats_* functions.

Page 6: Statistical Analysis With Oracle

COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 6

Box Plot

The quartiles are computed using the Oracle SQL function percentile_cont() within group (order by ()).

Page 7: Statistical Analysis With Oracle

COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 7

A real-life example: GE Delivery span

The delivery span focuses on the time between when a customer requested the product and when it was delivered.

The span is a measure of variation similar to the inter-quartile range but instead of looking at the middle 50% of

the observations, it looks at the middle 90%. Span = 95th Percentile – 5th Percentile

The Oracle SQL functions to compute the span are: percentile_cont(0.95) and percentile_cont(0.05).

Page 8: Statistical Analysis With Oracle

COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 8

A real-life example: GE Delivery span

The goal is to squeeze the two sides of the delivery span, days early & days late, ever closer to the center: the exact day the customer desired. Reducing variation is what quality is about.

Days Late

Days Early

Page 9: Statistical Analysis With Oracle

COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 9

Topics

Descriptive Statistics Linear Regression Correlation Performance measurement Distribution analysis Hypothesis Testing

Page 10: Statistical Analysis With Oracle

COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 10

Simple Linear Regression Studying the relationship of two time series

The built-in SQL functions used are regr_* and stats_*.

In this DeskI document, it appears that X and Y increase together.

Page 11: Statistical Analysis With Oracle

COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 11

Simple Linear Regression Checking the normality of the residuals

The SQL function row_number() over() is used. A custom PL/SQL function inverse_phi() is employed for building the X axis; an alternative consists of using a lookup Z table.

Page 12: Statistical Analysis With Oracle

COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 12

User defined functions

The Oracle user can extend the list of built-in functions with his own PL/SQL functions. Following are examples of custom functions.

The custom function inverse_phi().

Page 13: Statistical Analysis With Oracle

COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 13

User defined functions

Custom functions can be exposed in the BOBJ SQL editor by adding them in the ‘oracle.prm’ file.

The custom function inverse_phi().

Page 14: Statistical Analysis With Oracle

COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 14

Simple Linear Regression Detecting Outliers (unusual values)

The variables X and Y in this example are not time series. Mortality (Y) tends to decrease as education (X) increases. The limits are computed with the regr_* SQL functions.

Ouliers appear in red.

Page 15: Statistical Analysis With Oracle

COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 15

Simple Linear Regression Detecting Outliers (unusual values)

After changing the sigma factor from 2 to 2.5, New Orleans is no longer beyond limits.

Page 16: Statistical Analysis With Oracle

COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 16

Simple Non Linear Regression

Some simple non-linear relationships can be transformed into linear relationships. Every time you can transform

a relationship into an equation of the form Y = a + b X , you can use the least squares method to fit the data.

This technique is used for fitting curves such as exponential, logarithmic, power, hyperbola, logistic (Pearl) and Gompertz.

Page 17: Statistical Analysis With Oracle

COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 17

Simple Non Linear Regression Fitting a S-shaped curve (Pearl)

The sample below plots the cumulative sales by month for a given product. The X and Y variables are computed using respectively row_number() over and sum(sum()) over(). The ln() function is used for transformation; regr_slope() and regr_intercept() for getting the regression coefficients.

Page 18: Statistical Analysis With Oracle

COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 18

Simple Non Linear Regression Fitting a S-shaped curve (Gompertz)

The error values are computed within the BOBJ report.

The number of periods to forecast

Page 19: Statistical Analysis With Oracle

COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 19

Weighted Least Squares (local regression) Smoothing a time series

The moving slope and intercept are calculated on the RDBMS side. The user chooses the strength of the smoother (moving window size).

Page 20: Statistical Analysis With Oracle

COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 20

Multiple Linear Regression Forecasting

The regression coefficients are calculated on the fly by the rdbms using regr_* functions.

Page 21: Statistical Analysis With Oracle

COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 21

Topics

Descriptive Statistics Linear Regression Correlation Performance measurement Distribution analysis Hypothesis Testing

Page 22: Statistical Analysis With Oracle

COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 22

Checking for auto-correlation The Lag Plot

The lag plot helps uncover seasonality and patterns from sequential data (e.g. time series) that may have been missed by looking only at a line chart. The lag plot is useful to check auto-correlation (lack of independence in the series).

It consists of a scatter diagram plotting Yt on the vertical axis versus Yt-lag on the horizontal axis.

The Oracle SQL function lag(,) over() is illustrated next.

Page 23: Statistical Analysis With Oracle

COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 23

Checking for auto-correlation The Lag Plot

The sample line chart below tells us there is no shift of average over time. The lag plot (lag 1) on the right shows a pattern that cannot be seen in the line chart. Note that outliers (unusual values) appear more clearly on the lag plot.

Page 24: Statistical Analysis With Oracle

COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 24

Auto-correlation function (ACF)

While the lag plot displays the individual data points for a given lag, the ACF plot gives a summary picture of auto-correlation over multiple lags. The ACF plot is useful in identifying seasonal or cyclical patterns in a time series.

It shows the correlation coefficient statistic on the vertical axis over the different lags on the horizontal axis.

The Oracle SQL functions regr_* are used in the following example.

Page 25: Statistical Analysis With Oracle

COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 25

ACF Plot

The example below is the ACF plot of the monthly Australian beer production. The thin bars represent the correlation coefficients. The dotted flat lines are the significance limits.

The peak on lag 12 indicates seasonality.

Page 26: Statistical Analysis With Oracle

COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 26

Topics

Descriptive Statistics Linear Regression Correlation Performance measurement Distribution analysis Hypothesis Testing

Page 27: Statistical Analysis With Oracle

COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 27

Comparison types

Organizations measure performance of individual entities such as products, geographies, suppliers or employees by making different types of comparisons :

Over time

Against pre-determined goals

Against a comparator group

Page 28: Statistical Analysis With Oracle

COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 28

Evaluation against a group

Most of the examples presented in this section illustrate how to evaluate the performance of an individual entity against a comparator group. We will see various methods:

Percent rank T-score Percent of leader Range score Percent of total Individual ratio versus group ratio

Page 29: Statistical Analysis With Oracle

COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 29

Percent Rank

The WebI document below ranks countries on economic performance. It involves the following Oracle SQL functions: rank() over() and percent_rank() over().

Page 30: Statistical Analysis With Oracle

COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 30

T-score Comparing against the group average

The Xcelsius radar chart here requires the avg() over() and stddev_pop() over() SQL functions. The data set consists of 7 indicators and 7 countries.

The blue line represents the group average normalized at value 50. The amber line corresponds to the country T-score (here USA).

The slicer allows the user to see the performance of the different countries in the G7 group.

Page 31: Statistical Analysis With Oracle

COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 31

Percent of Leader Comparing against the best in the group

The sample document below rates UK versus other nations on educational research. It requires the max() over() SQL functions. The underlying data set covers 35 nations and includes 7 indicators within a period from 1987 to 1998.

Page 32: Statistical Analysis With Oracle

COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 32

Range score Positioning the individual in the group range

The report below evaluates companies performance based on 3 indicators. The range score normalizes the results before computing a composite score. It involves the SQL functions min() over() and max() over().

We apply a weighted average to obtain a composite score. This is done within the BOBJ document.

Page 33: Statistical Analysis With Oracle

COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 33

Percent of Total

Even though it is limited to additive indicators such as revenue, cost or number of customers, the percent of

group total is commonly used. The Pareto chart is an example of graph that displays the percentage of

each entity (represented as bars) relative to the total value of the group. It helps identify the largest contributors by presenting them first on the left side of the graph.

Page 34: Statistical Analysis With Oracle

COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 34

Percent of Total - Pareto chart

The Oracle built-in function ratio_to_report() over() enable to compute the percent of total required by the Pareto chart.

Page 35: Statistical Analysis With Oracle

COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 35

Cumulative Percent of Total

Here the percent of total for the two indicators is obtained with the ratio_to_report() over() SQL function. The cumulative sum is done within the BOBJ crosstab.

The report indicates that Callahan generates the largest revenue with the smallest number of customers. We also observe that roughly half of the Sales (48.7%) is generated by a quarter of the customers (22.4%) … and by a quarter of the sales force (Callahan and King).

Page 36: Statistical Analysis With Oracle

COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 36

Individual ratio versus group ratio

We just saw how we can apply a percent of total to additive indicators like Sales and Profit. Let’s assume we want to track the performance of Gross Profit rate calculated as Gross Profit divided by Gross Sales. The percent of total analysis will not work in that case since ratios are not additive. We could use the average of the Gross Profit rates in the group as the baseline but instead we will compare to the group Gross Profit rate obtained as: group total Gross Profit / group total Gross Sales.

Page 37: Statistical Analysis With Oracle

COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 37

Individual ratio versus group ratio

We use the rollup() SQL function in order to generate on the fly a summary row that aggregates values for Sales, Profit and Profit rate.

The blue cursors on both sides of the table indicate the group row which serves as the baseline. The categories above the baseline exhibit a better profit rate than the group.

Page 38: Statistical Analysis With Oracle

COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 38

Measuring process performance

Companies that implement a quality improvement methodology measure performance of their processes by analyzing the variation. A process is given upper and lower specification limits (USL & LSL) and a target. Variation over the target is measured to evaluate how capable the process is relative to the specifications. One process performance measurement method is presented next for illustration.

Page 39: Statistical Analysis With Oracle

COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 39

Process capability/performance Measuring process variations

The Oracle built-in SQL function stats_one_way_anova() returns the variations.

The custom PL/SQL function phi() is required to get the ppm/defect rates.

Page 40: Statistical Analysis With Oracle

COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 40

Topics

Descriptive Statistics Linear Regression Correlation Performance measurement Distribution analysis Hypothesis Testing

Page 41: Statistical Analysis With Oracle

COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 41

Frequency Histogram – 1/3

The bucketing of individual values is performed with the width_bucket() function. The count aggregation is done by the rdbms. Empty buckets are handled in the BOBJ report.

The data set is from the Oracle HR demo database. Salary exibits a distribution stretched out towards the right.

Page 42: Statistical Analysis With Oracle

COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 42

Frequency Histogram – 2/3

Running the query.

Listing the main power-transforms.

The end-user can enter a custom value: e.g. 0.277

The 3 arguments of the width_bucket() function are prompted

Page 43: Statistical Analysis With Oracle

COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 43

Frequency Histogram – 3/3

The square root transform brings some symmetry to the salary data. The power-transforms require the following SQL functions: power() and ln().

Page 44: Statistical Analysis With Oracle

COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 44

Automatic bucketing

The BOBJ document below requires no inputs from the user with regards to the histogram buckets. We use the Sturges rule to determine the number of bars automatically. The histogram count aggregation is performed by the rdbms.

The normal plot complements the histogram in assessing the normality of the data set.

Page 45: Statistical Analysis With Oracle

COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 45

User-defined intervals

In the sample below the end-user enters the histogram interval width. The count aggregation is done by the rdbms. The Gaussian curve is estimated within the BOBJ document.

The blue bars represent the differences between the observed frequencies (histogram bars) and the estimated frequencies (fitted curve). The Gaussian curve is turned into a horizontal line (zero baseline).

Page 46: Statistical Analysis With Oracle

COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 46

Weibull distribution

The Hazard plot below uses the following SQL Functions: row_number() over() and sum() over(). The regression line is computed within the BOBJ document.

Page 47: Statistical Analysis With Oracle

COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 47

Topics

Descriptive Statistics Linear Regression Correlation Performance measurement Distribution analysis Hypothesis Testing

Page 48: Statistical Analysis With Oracle

COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 48

t-test on a single group Comparing the mean to a target

In this sample report we use the Oracle SQL function stats_t_test_one().

Page 49: Statistical Analysis With Oracle

COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 49

t-test for two paired groups Before-After comparison of mean

In this sample report we use the Oracle SQL function stats_t_test_paired().

Page 50: Statistical Analysis With Oracle

COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 50

t-test for two independent groups Comparing means

In this sample report we use the Oracle SQL function stats_t_test_indep().

Page 51: Statistical Analysis With Oracle

COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 51

Analysis of variance Comparing means across multiple groups

The Oracle SQL function stats_one_way_anova() is used here. The following sample involves six groups.

Page 52: Statistical Analysis With Oracle

© 2008 SAP AG. All rights reserved.

No part of this publication may be reproduced or transmitted in any form or for any purpose without the express permission of SAP AG. The information contained herein may be changed without prior notice. Some software products marketed by SAP AG and its distributors contain proprietary software components of other software vendors. Microsoft, Windows, Excel, Outlook, and PowerPoint are registered trademarks of Microsoft Corporation. IBM, DB2, DB2 Universal Database, System i, System i5, System p, System p5, System x, System z, System z10, System z9, z10, z9, iSeries, pSeries, xSeries, zSeries, eServer, z/VM, z/OS, i5/OS, S/390, OS/390, OS/400, AS/400, S/390 Parallel Enterprise Server, PowerVM, Power Architecture, POWER6+, POWER6, POWER5+, POWER5, POWER, OpenPower, PowerPC, BatchPipes, BladeCenter, System Storage, GPFS, HACMP, RETAIN, DB2 Connect, RACF, Redbooks, OS/2, Parallel Sysplex, MVS/ESA, AIX, Intelligent Miner, WebSphere, Netfinity, Tivoli and Informix are trademarks or registered trademarks of IBM Corporation. Linux is the registered trademark of Linus Torvalds in the U.S. and other countries. Adobe, the Adobe logo, Acrobat, PostScript, and Reader are either trademarks or registered trademarks of Adobe Systems Incorporated in the United States and/or other countries. Oracle and Java are registered trademarks of Oracle and/or its affiliates. UNIX, X/Open, OSF/1, and Motif are registered trademarks of the Open Group. Citrix, ICA, Program Neighborhood, MetaFrame, WinFrame, VideoFrame, and MultiWin are trademarks or registered trademarks of Citrix Systems, Inc. HTML, XML, XHTML and W3C are trademarks or registered trademarks of W3C®, World Wide Web Consortium, Massachusetts Institute of Technology.

© 2008 SAP AG. All rights reserved.

SAP, R/3, SAP NetWeaver, Duet, PartnerEdge, ByDesign, SAP BusinessObjects Explorer, StreamWork, and other SAP products and services mentioned herein as well as their respective logos are trademarks or registered trademarks of SAP AG in Germany and other countries.

Business Objects and the Business Objects logo, BusinessObjects, Crystal Reports, Crystal Decisions, Web Intelligence, Xcelsius, and other Business Objects products and services mentioned herein as well as their respective logos are trademarks or registered trademarks of Business Objects Software Ltd. Business Objects is an SAP company.

Sybase and Adaptive Server, iAnywhere, Sybase 365, SQL Anywhere, and other Sybase products and services mentioned herein as well as their respective logos are trademarks or registered trademarks of Sybase, Inc. Sybase is an SAP company.

All other product and service names mentioned are the trademarks of their respective companies. Data contained in this document serves informational purposes only. National product specifications may vary.

The information in this document is proprietary to SAP. No part of this document may be reproduced, copied, or transmitted in any form or for any purpose without the express prior written permission of SAP AG.