Upload
-
View
15
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Statistical Analysis With Oracle
Citation preview
Statistical Analysis with Oracle
Marc Daniau – Product group [email protected]
COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 2
Introduction
The ANSI SQL offers analytical capabilities useful for advanced data analysis like linear regression and correlation. Beyond supporting the standard SQL functions Oracle provides additional vendor-specific functions that enable to perform statistical analysis such as descriptive statistics and hypothesis testing. In the following presentation we will put into practice the main statistical built-in functions from Oracle 10g release 2. We will also show an example of a user-defined function.
COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 3
Topics
Descriptive Statistics Linear Regression Correlation Performance measurement Distribution analysis Hypothesis Testing
COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 4
Summarizing a data set
The built-in SQL functions involved in the following sample are avg, median, stats_mode, variance, stddev, min,
max and percentile_cont.
Note there is no built-in functions for skew and kurtosis; we had to use custom SQL expressions.
COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 5
The BOBJ Universe
New functions can be added to the list by updating the ‘oracle.prm’ file.
Here is a sample Universe exposing the stats_* functions.
COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 6
Box Plot
The quartiles are computed using the Oracle SQL function percentile_cont() within group (order by ()).
COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 7
A real-life example: GE Delivery span
The delivery span focuses on the time between when a customer requested the product and when it was delivered.
The span is a measure of variation similar to the inter-quartile range but instead of looking at the middle 50% of
the observations, it looks at the middle 90%. Span = 95th Percentile – 5th Percentile
The Oracle SQL functions to compute the span are: percentile_cont(0.95) and percentile_cont(0.05).
COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 8
A real-life example: GE Delivery span
The goal is to squeeze the two sides of the delivery span, days early & days late, ever closer to the center: the exact day the customer desired. Reducing variation is what quality is about.
Days Late
Days Early
COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 9
Topics
Descriptive Statistics Linear Regression Correlation Performance measurement Distribution analysis Hypothesis Testing
COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 10
Simple Linear Regression Studying the relationship of two time series
The built-in SQL functions used are regr_* and stats_*.
In this DeskI document, it appears that X and Y increase together.
COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 11
Simple Linear Regression Checking the normality of the residuals
The SQL function row_number() over() is used. A custom PL/SQL function inverse_phi() is employed for building the X axis; an alternative consists of using a lookup Z table.
COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 12
User defined functions
The Oracle user can extend the list of built-in functions with his own PL/SQL functions. Following are examples of custom functions.
The custom function inverse_phi().
COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 13
User defined functions
Custom functions can be exposed in the BOBJ SQL editor by adding them in the ‘oracle.prm’ file.
The custom function inverse_phi().
COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 14
Simple Linear Regression Detecting Outliers (unusual values)
The variables X and Y in this example are not time series. Mortality (Y) tends to decrease as education (X) increases. The limits are computed with the regr_* SQL functions.
Ouliers appear in red.
COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 15
Simple Linear Regression Detecting Outliers (unusual values)
After changing the sigma factor from 2 to 2.5, New Orleans is no longer beyond limits.
COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 16
Simple Non Linear Regression
Some simple non-linear relationships can be transformed into linear relationships. Every time you can transform
a relationship into an equation of the form Y = a + b X , you can use the least squares method to fit the data.
This technique is used for fitting curves such as exponential, logarithmic, power, hyperbola, logistic (Pearl) and Gompertz.
COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 17
Simple Non Linear Regression Fitting a S-shaped curve (Pearl)
The sample below plots the cumulative sales by month for a given product. The X and Y variables are computed using respectively row_number() over and sum(sum()) over(). The ln() function is used for transformation; regr_slope() and regr_intercept() for getting the regression coefficients.
COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 18
Simple Non Linear Regression Fitting a S-shaped curve (Gompertz)
The error values are computed within the BOBJ report.
The number of periods to forecast
COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 19
Weighted Least Squares (local regression) Smoothing a time series
The moving slope and intercept are calculated on the RDBMS side. The user chooses the strength of the smoother (moving window size).
COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 20
Multiple Linear Regression Forecasting
The regression coefficients are calculated on the fly by the rdbms using regr_* functions.
COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 21
Topics
Descriptive Statistics Linear Regression Correlation Performance measurement Distribution analysis Hypothesis Testing
COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 22
Checking for auto-correlation The Lag Plot
The lag plot helps uncover seasonality and patterns from sequential data (e.g. time series) that may have been missed by looking only at a line chart. The lag plot is useful to check auto-correlation (lack of independence in the series).
It consists of a scatter diagram plotting Yt on the vertical axis versus Yt-lag on the horizontal axis.
The Oracle SQL function lag(,) over() is illustrated next.
COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 23
Checking for auto-correlation The Lag Plot
The sample line chart below tells us there is no shift of average over time. The lag plot (lag 1) on the right shows a pattern that cannot be seen in the line chart. Note that outliers (unusual values) appear more clearly on the lag plot.
COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 24
Auto-correlation function (ACF)
While the lag plot displays the individual data points for a given lag, the ACF plot gives a summary picture of auto-correlation over multiple lags. The ACF plot is useful in identifying seasonal or cyclical patterns in a time series.
It shows the correlation coefficient statistic on the vertical axis over the different lags on the horizontal axis.
The Oracle SQL functions regr_* are used in the following example.
COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 25
ACF Plot
The example below is the ACF plot of the monthly Australian beer production. The thin bars represent the correlation coefficients. The dotted flat lines are the significance limits.
The peak on lag 12 indicates seasonality.
COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 26
Topics
Descriptive Statistics Linear Regression Correlation Performance measurement Distribution analysis Hypothesis Testing
COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 27
Comparison types
Organizations measure performance of individual entities such as products, geographies, suppliers or employees by making different types of comparisons :
Over time
Against pre-determined goals
Against a comparator group
COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 28
Evaluation against a group
Most of the examples presented in this section illustrate how to evaluate the performance of an individual entity against a comparator group. We will see various methods:
Percent rank T-score Percent of leader Range score Percent of total Individual ratio versus group ratio
COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 29
Percent Rank
The WebI document below ranks countries on economic performance. It involves the following Oracle SQL functions: rank() over() and percent_rank() over().
COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 30
T-score Comparing against the group average
The Xcelsius radar chart here requires the avg() over() and stddev_pop() over() SQL functions. The data set consists of 7 indicators and 7 countries.
The blue line represents the group average normalized at value 50. The amber line corresponds to the country T-score (here USA).
The slicer allows the user to see the performance of the different countries in the G7 group.
COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 31
Percent of Leader Comparing against the best in the group
The sample document below rates UK versus other nations on educational research. It requires the max() over() SQL functions. The underlying data set covers 35 nations and includes 7 indicators within a period from 1987 to 1998.
COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 32
Range score Positioning the individual in the group range
The report below evaluates companies performance based on 3 indicators. The range score normalizes the results before computing a composite score. It involves the SQL functions min() over() and max() over().
We apply a weighted average to obtain a composite score. This is done within the BOBJ document.
COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 33
Percent of Total
Even though it is limited to additive indicators such as revenue, cost or number of customers, the percent of
group total is commonly used. The Pareto chart is an example of graph that displays the percentage of
each entity (represented as bars) relative to the total value of the group. It helps identify the largest contributors by presenting them first on the left side of the graph.
COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 34
Percent of Total - Pareto chart
The Oracle built-in function ratio_to_report() over() enable to compute the percent of total required by the Pareto chart.
COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 35
Cumulative Percent of Total
Here the percent of total for the two indicators is obtained with the ratio_to_report() over() SQL function. The cumulative sum is done within the BOBJ crosstab.
The report indicates that Callahan generates the largest revenue with the smallest number of customers. We also observe that roughly half of the Sales (48.7%) is generated by a quarter of the customers (22.4%) … and by a quarter of the sales force (Callahan and King).
COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 36
Individual ratio versus group ratio
We just saw how we can apply a percent of total to additive indicators like Sales and Profit. Let’s assume we want to track the performance of Gross Profit rate calculated as Gross Profit divided by Gross Sales. The percent of total analysis will not work in that case since ratios are not additive. We could use the average of the Gross Profit rates in the group as the baseline but instead we will compare to the group Gross Profit rate obtained as: group total Gross Profit / group total Gross Sales.
COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 37
Individual ratio versus group ratio
We use the rollup() SQL function in order to generate on the fly a summary row that aggregates values for Sales, Profit and Profit rate.
The blue cursors on both sides of the table indicate the group row which serves as the baseline. The categories above the baseline exhibit a better profit rate than the group.
COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 38
Measuring process performance
Companies that implement a quality improvement methodology measure performance of their processes by analyzing the variation. A process is given upper and lower specification limits (USL & LSL) and a target. Variation over the target is measured to evaluate how capable the process is relative to the specifications. One process performance measurement method is presented next for illustration.
COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 39
Process capability/performance Measuring process variations
The Oracle built-in SQL function stats_one_way_anova() returns the variations.
The custom PL/SQL function phi() is required to get the ppm/defect rates.
COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 40
Topics
Descriptive Statistics Linear Regression Correlation Performance measurement Distribution analysis Hypothesis Testing
COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 41
Frequency Histogram – 1/3
The bucketing of individual values is performed with the width_bucket() function. The count aggregation is done by the rdbms. Empty buckets are handled in the BOBJ report.
The data set is from the Oracle HR demo database. Salary exibits a distribution stretched out towards the right.
COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 42
Frequency Histogram – 2/3
Running the query.
Listing the main power-transforms.
The end-user can enter a custom value: e.g. 0.277
The 3 arguments of the width_bucket() function are prompted
COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 43
Frequency Histogram – 3/3
The square root transform brings some symmetry to the salary data. The power-transforms require the following SQL functions: power() and ln().
COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 44
Automatic bucketing
The BOBJ document below requires no inputs from the user with regards to the histogram buckets. We use the Sturges rule to determine the number of bars automatically. The histogram count aggregation is performed by the rdbms.
The normal plot complements the histogram in assessing the normality of the data set.
COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 45
User-defined intervals
In the sample below the end-user enters the histogram interval width. The count aggregation is done by the rdbms. The Gaussian curve is estimated within the BOBJ document.
The blue bars represent the differences between the observed frequencies (histogram bars) and the estimated frequencies (fitted curve). The Gaussian curve is turned into a horizontal line (zero baseline).
COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 46
Weibull distribution
The Hazard plot below uses the following SQL Functions: row_number() over() and sum() over(). The regression line is computed within the BOBJ document.
COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 47
Topics
Descriptive Statistics Linear Regression Correlation Performance measurement Distribution analysis Hypothesis Testing
COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 48
t-test on a single group Comparing the mean to a target
In this sample report we use the Oracle SQL function stats_t_test_one().
COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 49
t-test for two paired groups Before-After comparison of mean
In this sample report we use the Oracle SQL function stats_t_test_paired().
COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 50
t-test for two independent groups Comparing means
In this sample report we use the Oracle SQL function stats_t_test_indep().
COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED. SLIDE 51
Analysis of variance Comparing means across multiple groups
The Oracle SQL function stats_one_way_anova() is used here. The following sample involves six groups.
© 2008 SAP AG. All rights reserved.
No part of this publication may be reproduced or transmitted in any form or for any purpose without the express permission of SAP AG. The information contained herein may be changed without prior notice. Some software products marketed by SAP AG and its distributors contain proprietary software components of other software vendors. Microsoft, Windows, Excel, Outlook, and PowerPoint are registered trademarks of Microsoft Corporation. IBM, DB2, DB2 Universal Database, System i, System i5, System p, System p5, System x, System z, System z10, System z9, z10, z9, iSeries, pSeries, xSeries, zSeries, eServer, z/VM, z/OS, i5/OS, S/390, OS/390, OS/400, AS/400, S/390 Parallel Enterprise Server, PowerVM, Power Architecture, POWER6+, POWER6, POWER5+, POWER5, POWER, OpenPower, PowerPC, BatchPipes, BladeCenter, System Storage, GPFS, HACMP, RETAIN, DB2 Connect, RACF, Redbooks, OS/2, Parallel Sysplex, MVS/ESA, AIX, Intelligent Miner, WebSphere, Netfinity, Tivoli and Informix are trademarks or registered trademarks of IBM Corporation. Linux is the registered trademark of Linus Torvalds in the U.S. and other countries. Adobe, the Adobe logo, Acrobat, PostScript, and Reader are either trademarks or registered trademarks of Adobe Systems Incorporated in the United States and/or other countries. Oracle and Java are registered trademarks of Oracle and/or its affiliates. UNIX, X/Open, OSF/1, and Motif are registered trademarks of the Open Group. Citrix, ICA, Program Neighborhood, MetaFrame, WinFrame, VideoFrame, and MultiWin are trademarks or registered trademarks of Citrix Systems, Inc. HTML, XML, XHTML and W3C are trademarks or registered trademarks of W3C®, World Wide Web Consortium, Massachusetts Institute of Technology.
© 2008 SAP AG. All rights reserved.
SAP, R/3, SAP NetWeaver, Duet, PartnerEdge, ByDesign, SAP BusinessObjects Explorer, StreamWork, and other SAP products and services mentioned herein as well as their respective logos are trademarks or registered trademarks of SAP AG in Germany and other countries.
Business Objects and the Business Objects logo, BusinessObjects, Crystal Reports, Crystal Decisions, Web Intelligence, Xcelsius, and other Business Objects products and services mentioned herein as well as their respective logos are trademarks or registered trademarks of Business Objects Software Ltd. Business Objects is an SAP company.
Sybase and Adaptive Server, iAnywhere, Sybase 365, SQL Anywhere, and other Sybase products and services mentioned herein as well as their respective logos are trademarks or registered trademarks of Sybase, Inc. Sybase is an SAP company.
All other product and service names mentioned are the trademarks of their respective companies. Data contained in this document serves informational purposes only. National product specifications may vary.
The information in this document is proprietary to SAP. No part of this document may be reproduced, copied, or transmitted in any form or for any purpose without the express prior written permission of SAP AG.