MA Statistics Tutorial


    This tutorial offers a chance for students with limited statistics background a concise

    review of and introduction to fundamental topics in the MA program. It also

    provides a refresher for students with more extensive statistics backgrounds.

    To encourage a practical understanding, topics are presented using actual data for airtravel data and Excel screenshots of statistical results.

    There is a self-test at the end of each section to help each student evaluate grasp of the


    No one will grade these self tests; responsibility rests with the student Students are advised to review incorrect answers and seek additional assistance in

    understanding incorrect answers if needed. Students may

    with questions.

    Additional, concise sources of information on the topics presented are available fromHyperstats

    Statsoft Electronic Textbook --

    Section I

    Descriptive Statistics and

    Measures of Sampling Error

    Air Travel Data

    For 21 cities, the following data have been

    recorded or computed:

    City = city identifying code

    Fare = cheapest coach fare from Nashville to

    city in $ on Orbitz on given a given day

    Distance = distance in miles for the routeFare per Mile = Fare divided by distance

    Excel Screen Shot of Data

    Distribution of Fare per Mile

    The histogram has a normal (bell-shaped)distribution curve superimposed

    The distribution of fare per mile is similar to

    the normal after smoothing out the rectangles,

    but is just slightly right-tilted or skewed

    This graphic was produced by the statisticalsoftware package, SPSS

    Table 1-- Descriptive Statistics for Fare per Mile

    Key Univariate Descriptive Statistics

    Mean= average of 28 cents per mile

    Median = middle value (50thpercentile); so half ofthe values are above 28.8 cents per mile and half are

    below; the median is a better measure of the center

    of the data set when the data are highly skewed

    Standard deviation= average distance or

    variability from the mean fare for the observations;

    in this case, the 21 observations differ from the

    mean by an average of 9.6 cents per mile

    Range= difference between the minimum and

    maximum values

    Skewness= degree of asymmetry; zero is perfectly

    symmetric; large positive values (1.0 or larger)

    indicate a leaning to the right; large negative values

    indicate a leaning to the left; the value of 0.559

    indicates a slight rightward skew as shown in the

    graph on the prior page

    Sample Statistics v. Population Parameters

    The statistics reported in Table 1 are sample statisticstheysummarize the 21 observations in the sample

    The full set of all possible fares between all cities of interest would

    represent the population of fares and fares per mile

    Population Parameterrefers to a summary measure using all possible

    data; for example, the population mean or population standarddeviation

    The sample statisticsreported in Table 1 provide estimates of these

    population parameters

    Table 1 also provides numerical estimates of the accuracy and

    reliability of the sample mean in estimating the population mean (seenext slide)

    Table 1-- Estimates of Sampling Error

    Key Univariate Descriptive Statistics

    Standard Error (of sample mean)= estimate ofthe likely sampling error between the sample mean

    and the population mean; 0.021 implies that

    repeated samples of the same size could easily find

    sample means 2.1 cents higher or lower;

    Confidence Levels (95%) = roughly two times the

    standard error; (for 99%, it is roughly 2.5 times thestandard error); as such, it provides a figure similar

    to the standard error, but with a wider margin for

    error; 0.044 with a 95% Confidence level implies

    that about 95 out of 100 samples of this size would

    likely result in sample means within 4.4 cents of the

    estimated value

    How Reliable are Sample Error Estimates?

    Standard Errors and Confidence Intervals estimate sampling error Sampling Error is error arising because one is using less than the entire


    To accurately estimate population parameters and sampling error,

    samples must be representative of the population

    Randomly selected samples are the best (though not foolproof way) ofassuring this

    Error not related to sampling selection (question bias, response bias,

    dishonest responses, data entry errors, ) must be small relative to the

    size of the sampling error

    This kind of error is called non-sampling error

    Using Sampling Error in Testing Claims

    (Hypothesis Testing)

    Estimates of sampling error permit a claims or conjectures (hypotheses)concerning population parameters to be tested with sample statistics while

    taking into account a margin for error

    Testing a claim for the population mean:

    Suppose someone thinks that the mean fare per mile for the full population is 30

    cents or higher

    Given the sample mean (0.282) and the standard error of 0.21, it is quite likely that

    another sample would yield an estimate of 30 cents or higher

    If we double the standard error to get a 95% confidence interval and margin for

    error of 0.042, we see that the claim of 30 cents or higher is quite likely

    In contrast, if someone were to claim that the mean is 35 cents or higher, the

    standard error and confidence interval suggests that such a figure is not very likely

    Testing Claims with P-values

    Put briefly, a p-value shows thelikelihood of obtaining the sampleestimate by chance if the null hypothesiswere true

    Take the claim of a mean of .30 testedhere (using SPSS software) given thesample mean of 0.282 and s.e. of 0.021

    The estimated p-value (called Sig.-2tailed) is 0.425

    The chance of finding such a value bychance is 42.5 percent

    Typically, we reject the null only if thisp-value is below a 5 percent threshold

    Note: our test is really 1-tailed since weare testing greater than 0.30. We shouldcut the p-value in half to 21.25, but this isstill well above 0.05

    One-Sample Test

    -.814 20 .425 -.01714 -.0611 .0268farepermilet df Sig. (2-tailed)


    Difference Lower Upper

    95% ConfidenceInterval of the


    Test Value = .30

    Testing Claims with P-values

    Now, test a mean of .35 or higher: The estimated p-value (called Sig.-2

    tailed) is 0.005

    The chance of finding such a value by

    chance is 0.5 percent which is far below

    the 5 percent threshold even before

    cutting it in half for a 1-tailed test The p-value indicates that there is only a

    0.5 percent chance of finding our mean of

    0.282 if the true mean were 0.35 or


    The null hypothesis of a mean of 0.35 or

    higher is rejected

    One-Sample Test

    -3.189 20 .005 -.06714 -.1111 -.0232farepermile

    t df Sig. (2-tai led)


    Di ffe re nce L owe r Up pe r

    95% Confidence

    Interval of the


    Test Value = .35

    Sidebar on Hypothesis Testing

    In the previous slide, the proposition that the

    coefficient was equal to zero was tested using thep-value Any time that a p-value appears, a null hypothesis is

    being tested

    The proposition being examined is called the nullhypothesis

    Using p-values from the output of software is thesimplest way of testing a hypothesis

    With small data sets, especially with small effects being

    tested, a p-value may not be below 0.05. This does not mean that the null hypothesis is true It may indicate that the test lacks Power to reject a false null

    (due to lack of data); See Statsoft textbook under xxxxxxx forfurther information

    Sidebar on Hypothesis Testing

    In addition to p-values, t-statistics andconfidence intervals (all derived from

    standard errors) can also test a hypothesis

    As a rule-of-thumb, t-values greater than 2 in

    absolute value are equivalent to p-values below


    Self Test Section I

    The self test uses a data set on 5K running times;the raw data appears on the next slide; variablesare

    Time = 5k time in minutes (decimals are fractions

    of minutes) Age = age in years

    Intervals = 1 if hard interval workouts were usedand 0 if not;

    Miles Per Week = number of miles per week intraining at peak of training

  • 8/21/2019 MA Statistics Tutorial


    Self-Test for Section I1. The measure that provides the middle or 50thpercentile observation is

    a. 19.30

    b. 19.50

    c. 0.800

    d. 19.00

    2. The statistic that indicates how spread out the individual 5k times are from theaverage time is

    a. 3.250

    b. 0.160

    c. 0.800

    d. 0.192

    3. Based on the data, you can say that the times are

    a. Nearly symmetric

    b. Highly skewed to the right

    c. Highly skewed to the left

    d. Not enough information

    Self-Test for Section I4. The likely sampling error for the mean is The measure that provides the middle or

    50thpercentile observation is

    a. 0.160

    b. 0.192

    c. 0.800

    d. 3.250

    5. The 95% confidence interval for the mean is computed by

    a. Multiplying the standard error by about 2.0

    b. Multiplying the standard deviation by 95%

    c. Dividing the range by about 10

    d. Dividing the mean by the sample size

    6. The value for Age for the second observation is

    a. 42

    b. 21

    c. 22

    d. 44

    Self-Test for Section I7. In the output, a test of the mean is provided. The null hypothesis being tested is

    a. That the population mean equals 19.3

    b. That the population mean equals -4.373

    c. That the population mean equals -0.700

    d. That the population mean equals 20

    8. The results in the table provide a 2-tailed test. To compute a 1-tailed test, youwould

    a. Double the p-value

    b. Divide the t-statistic by two

    c. Divide the p-value by two

    d. Double the size of the confidence interval

    9. Which of the following indicates that the null hypothesis should be rejected?

    a. t = -4.373

    b. p-value (Sig. 2-tailed) = 0.000

    c. Both a and b

    d. Neither a or b One-Sample Test

    -4.373 24 .000 -.70000 -1.0304 -.3696Timet df Sig. (2-tailed)


    Difference Lower Upper

    95% Confidence

    Interval of the


    Test Value = 20

    Correct Answers to Self-Test Section I

    1. A2. C

    3. A (the skewness statistic is very small, 0.192, indicating only a

    slight amount of positive skew; 0 would be perfectly symmetric;

    above or below 1.0/-1.0 would indicate substantial asymmetry)

    4. A

    5. A

    6. C (go back to the original data sheet for this)

    7. D (this is indicated by the Test Value = 20 in the SPSS output)

    8. C (the test provided is 2-tailed because it tests whether the meanequals 20 or not; 1-tailed would test whether it was 20 or more)

    9. C (the p-value is less than the typical 0.05 threshold for rejecting the

    null hypothesis; the t-values absolute value is greater than 2.0)

    Section II

    Regression Analysis

    Relationships Between Variables

    In economics investigators are frequently interested in how one variable interacts with

    another; Example: sales and income

    Often, one of the variables causes changes in the other such as higher incomes causingmore sales.

    The causal variable is referred to as the X, Independent, or Explanatory Variable

    The responding variable is referred to as the Y or Dependent Variable

    Sometimes the relationship is not causal but merely one of association because of links

    to a third variable

    Example: SAT & ACT cores, which are both caused by academic ability and achievement

    The most frequently used statistical technique used to examine relationships between

    variables is Regression Analysisor some technique that is very similar to regression


    Regression analysis can be used for all kinds data and relationships including

    Linear relationships and Curved relationships

    Quantitative data and Qualitative data

    Cross-sectional and time series data

    The following slides present the simplest form of Regression Analysis

    A quantitative dependent variable (Air Fare) and one quantitative, independent

    variable (Distance)

    The relationship is treated as linear

    Scatterplot for Fare & Distance

    The Scatterplot presented in Figure 1

    depicts the 21 Fare (Y-axis) and

    Distance (X-axis) combinations in the

    data set

    The graph shows that as distance

    increases, fare also tends to increase,

    but that the relationship is not perfect;

    otherwise, it would lay on a straight line

    Fig. 2 -- Scatterplot of Fare (Y) and Distance (X)










    0 500 1000 1500 2000 2500



    Regression from a Visual Standpoint

    Figure 3. Scatterplot and Regression

    Plot for Fare-Distance






    0 500 1000 1500 2000 2500Distance


    Figure 3 adds another element to the

    plota straight line of points (a line

    connecting the pink points)

    These points represent the regressionline that Excel chose as the straight line

    that best fit the scatterplot points

    Software chooses the line to minimize

    the sum of the (squared) distances

    between the blue points and the pinklinethis method is called the Least

    Squares or Ordinary Least Squares

    (OLS) method and is widely used

    Fare-Distance Regression as Tabular Output

    Regression in Table Form

    Table R1 presents the same regression results Figure 3

    The Regression Statistics and ANOVA parts of the table

    evaluate the overall performance of the regression in predicting

    Fare to different cities The bottom part with Coefficients for Intercept and Distance

    presents the regression line as numbers that can be put into an

    equation along with estimates of sampling error

    The following slides breakdown the different parts of the table

    Regression output always implies an equation written generally as

    y = b0 + b1*X

    b0 = y-intercept

    b1 = slope (change in Y over change in X)

    b0 and b1 are referred to as regression coefficients or intercept coefficient

    and slope coefficient

    The pink line in Figure 3 can be written down as an equation

    Recall, the slope-intercept form of a line (y=mx+b) from basic algebraif you

    draw a line through the pink points in Figure 3, and extend it to where Distance

    (X) = 0, the intercept should be obvious

    The equation for this line is

    Fare = 157 + 0.084 * Distance + Error(Intercept) (Slope)


    Intercept 157.614

    Distance 0.084

    Slope & Intercept Meaning

    The slope indicates that for every 1 mile Distance, the Fare is

    increasing by 0.084 (or about 8 cents). The slope produced in regression analysis always shows the amount of

    increase in Y (or decrease if negative) for a 1 unit increase in X

    To correctly interpret the slope for a regression, it is critical to know the

    units in which X and Y are measured; here, the units are miles and dollars

    A 100 mile increase implies an $8.40 (100 x 0.084) increase in Fare

    The y-intercept indicates that if distance were 0, the fare would be 157

    The intercept in this case is not an economically meaningful number

    because there are no flights of 0 miles

    The intercept merely extends the line to the X-axis for statistical purposes Be aware of the relevant range (min, max) of the X-variable

  • 8/21/2019 MA Statistics Tutorial


    Regression Line Errors (Residuals)

    Using the regression equation, Y-values for

    given X-values can be calculated Predicted Y= intercept + slope*(X-value)

    Example: Observation 1 is Dallaswith a

    distance of 600 miles:

    Predicted Fare = 157.6 + 0.084*(600) = 208

    (Excels prediction is 208.310werounded)

    The regression Error (residual)=

    Actual Y valuePredicted Yvalue

    For Dallas (observation 1), the actual farewas $250, so we calculate

    Residual = 250208.310 = 41.690

    Each observation has a predicted fare and

    error associated with it

    R Squarereports the percent of the Y-variable explained by the


    In other words, expresses (as a percent) how close the regression

    line points come to predicting the actual scatterplot points

    The maximum R-square is 1.0 (100%) and the minimum is 0.

    In this case, Distance, by itself, can account for 48.6% of the Fare

    differences between cities

    In a 2-variable regression like this one, the Multiple R is the same

    thing as the Correlation Coefficient between X and Y.

    The R-square is the squared correlation coefficient in such cases. Its maximum is 1.0 or -1.0 (perfectly correlated) and 0 is the min

    It can take on positive or negative values depending on the direction

    of the relationship between the two variables

    Multiple R 0.697

    R Square 0.486

    Adjusted R Square 0.459

    Standard Error 43.294

    Observations 21.000

    Regression Coefficient Accuracy

    Just like the sample mean, the regression coefficients are sample statistics

    that are usually used to estimate what the true relationship would be if all

    possible data were used

    Regression coefficients, therefore, also have standard errorsthat estimate

    their sampling error

    The slope coefficient for distance (0.08) has a standard error of 0.02

    This implies that the population parameter (regression coefficient using

    all possible data) may easily be 2 cents higher or lower than the 0.08

    coefficient estimated by this sample For a wider (apx. 95%) margin for error, this standard error can be

    multiplied by about 2.0

  • 8/21/2019 MA Statistics Tutorial


    More on Regression Coefficient Accuracy

    The t-statandp-valueare also ways of assessing the reliability of the


    They test whether the coefficient is significantly different from zero

    As a rule of thumb, if the t-statistic is > 2.0 (< - 2.0), this is viewed as

    significantly different from zero

    The t-Stat on Distance is 4.239, so it is statistically significant

    The p-value estimates the likelihood of finding the coefficient of 0.084

    by mere chance if the true value were zero

    The p-value of 0.000 indicates that this would be very unlikely, also

    showing a statistically significant result

    In scientific research, p-values below 5 percent (0.05) are taken as

    statistically significant

    In other settings, the cutoff level for the p-value may vary

    Expanded Regression Analysis

    In most situations in economics, investigators look at the effects of multiple

    variables on a dependent variables when using regression analysis

    Example: price and income effects on sales

    Such regressions are sometimes called multiple regression analysis and

    involve only slight modifications of the earlier points

    Also, economists widely use qualitative variables as independent variables.

    When these take on only two values (male, female) they are usually coded as

    (1,0) and called binary or dummy variables

    In the Air Travel data, we have such a variable, Direct SWA, that indicateswhether Southwest Airlines flies this route directly (1) or not (0). This

    variable is added to the regression analysis, resulting in the following Excel


    Fare Regression with Distance and Direct SWA

  • 8/21/2019 MA Statistics Tutorial


    The regression equation is now

    Fare = 193 + 0.08*Distance66*Direct SWA + Residual

    The slope coefficient for Distance is still about 0.08

    The y-intercept coefficient was 157; It is now 193

    The Direct SWA variable has these effects:

    When SWA = 0 (when SWA does not fly that route), the regression equation is

    Fare = 193 + 0.081*Distance ; because -66*(0) = 0

    When SWA =1 (when SWA flies the route), the regression equation is

    Fare = 193 + 0.081* Distance66*(1) = 127 + 0.081*Distance

    Note that the SWA dummy variable only influences the y-intercept

    The SWA variable does not influence the slope for distance (see next slide)

    Coefficients Standard Error t Stat P-value

    Intercept 193.032 14.411 13.395 0.000

    Distance 0.081 0.012 6.698 0.000

    Direct SWA -66.779 11.446 -5.834 0.000

  • 8/21/2019 MA Statistics Tutorial


    Distance Line Fit Plot








    0 500 1000 1500 2000 2500



    The line connecting the upper pink dots

    shows the regression line when

    SWA= 0

    The line connecting the lower pink dotsshows the regression line when SWA=1

    The Fare-Distance slope for both lines

    is 0.08

    Table R2 Regression with Multiple X-

  • 8/21/2019 MA Statistics Tutorial


    Another important difference that results from adding the SWA variable is the

    increase in the R-Square value

    It is now 82.2 (it was about 48% when using only Distance)

    The combination of Distance and Direct SWA account for 82.2% of the differences

    in Fares across cities.

    Adding SWA increased this value by about 36%

    Table R2. Regression with Multiple X-

    Regression Statistics

    Multiple R 0.907

    R Square 0.822

    Adjusted R Square 0.802

    Standard Error 26.161Observations 21.000

    From the regression predictions and errors, Excel (and other software) compute an Analysis ofVariance or ANOVA

    The F-Statisticis the most important number here; itcomputes the ratio of the mean regression sum ofsquares by the mean residual sum of squares

    Unlike the R-Square value, the F-statistic adjusts for the number of variables used

    The Significance F is simply a p-valuetesting the null hypothesis that the F-statistic equals zero;With this data, this null hypothesis is rejected because the p-value is very low

    In effect, the F-statistic tests whether the X-variables, as a group matter in explaining the Y-variable

    The SS above refers to Sum of Squares.

    The Residual SS simply squares the individual errors and adds them up. MS refers to mean sum ofsquares which divides the SS by the number of observations (minus the number of variables in theregression).

    The Predicted sum of squares computes differences in the actual and predicted values for Fare andthen adds them up

    The Total sum of squares adds the Predicted and Residual together

    The R-Square is simply the regression sum of squares divided by the total

    The Adjusted R-squared, like the F-statistic, adjusts for the number of variables used


    df SS MS F Significance F

    Regression 2.000 56977.083 28488.541 41.627 0.000

    Residual 18.000 12318.727 684.374

    Total 20.000 69295.810

    Regression Pointers

    Regressions that are well done have residuals that have no obvious

    patterns and are roughly bell shaped; Checking the residuals for theseand other characteristics is called Residual Analysis

    Regressions that leave out key explanatory (X) variables can yieldmisleading slopesthis is called the Omitted Variables Bias;

    Regressions leaving out key variables should be viewed as exploratoryor preliminary in nature

    There is no magical R-squared value to be obtained; if a model is puttogether well, then a low R-squared is fine; if a model has key flaws init, then a high R-Squared value does not make it good

    Only humans can determine if a regression is causal (Income-Sales) ormerely associative (SAT-ACT); the software treats both cases the same

    Self Test Section II

    The self test again uses a data set on 5Krunning times shown on the next slide

    Time = 5k time in minutes (decimals are

    fractions of minutes) Age = age in years

    Intervals = 1 if hard interval workouts wereused and 0 if not;

    Miles Per Week = number of miles perweek in training at peak of training

  • 8/21/2019 MA Statistics Tutorial


    For These Questions, Refer to this Output

    1. The regression equation depicted by the table is

    a. 5k Time = 0.731 + Age + Intervals + Residual

    b. 5k Time = 17.554 + Age*Intervals + Residual

    c. 5k Time = 17.554+ 0.071*(-0863)*Age*Intervals + Residuald. 5k Time = 17.554 + 0.071*Age0.863*Intervals + Residual

    2. The percent of 5k time differences accounted for by Age andIntervals in the regression model is

    a. 0.731b. 17.554

    c. 12.660

    d. 0.535

    3. The slope coefficient for Age isa. 0.071

    b. 0.731

    c. 17.554

    d. 0.016

    4 Th lik l li i h l ffi i f A i

    4. The likely sampling error in the slope coefficient for Age is

    a. 0.071

    b. 0.731

    c. 17.554

    d. 0.016

    5. The slope coefficient for Age implies that

    a. For each 1 minute increase in Time, Age increases by 0.071 years

    b. For each 1 year increase in Age, Time increases by 1 minute

    c. For each 1 year increase in Age, Time increases by 0.071 minutes

    d. For each 1 year increase in Time, Age increases by about 53%

    6. The regression results imply that if Age were 0, then Time would bea. 0.731

    b. 12.660

    c. 24.000

    d. 17.554

    7 The value in the preceding question

    7. The value in the preceding question

    a. Means that a newborn baby would be predicted to run this time in a 5k

    b. Means that the value is really only a hypothetical extension of theregression line because none of the actual data go back to zero years of Age

    c. Means that the regression is not reliable at any values

    d. Means that babies should compete in the Olympics

    8. The coefficient for Intervals implies that

    a. When interval equals 1, the Age slope is reduced by 0.863

    b. When interval equals 0, the y-intercept value is reduced by 0.863 minutes

    c. When interval equals 1, the Age slope is the same but the entire regressionline shifts down by 0.863 minutes

    d. When interval equals 0, the Age slope is the same but the entire regressionline shifts down by 0.863 minutes

    9. If you wanted to compute the effects of 10 more years of Age on thepredicted 5k Time, you should multiply

    a. 0.10 x 0.071

    b. 10 x 0.071

    c. 10 x 1.0

    d. 100 x 0.071

    10 Th di t d l f 5k Ti h i 47 d i i t l i

    10. The predicted value for 5k Time when a person is 47 and using intervals in

    training would be found by which of the following equations?

    a. Predicted 5k Time = 17.554 + 0.071*(47)

    b. Predicted 5k Time = 0.731 + 0.072*470.863*((1)

    c. Predicted 5k Time = 17.554 + 0.071*(47)0.863*(1)d. Predicted 5k Time = 0.071*(47) - 0.863*(1)

    11. Using the data sheet provided earlier, compute the residual for the first

    observation. (Note: you will first have to compute the predicted time)

    a. -0.545b. 0.631

    c. -.034

    d. 1.232

    12. The data provided on the accuracy of the coefficients indicates thata. All are not significantly different from zero

    b. Age is significantly different from zero but not Intervals

    c. Intervals is significantly different from zero but not Age

    d. All are significantly different from zero

    Correct Answers Section II Self Test

    1. D

    2. D

    3. A

    4. D

    5. C

    6. D7. B

    8. C

    9. B (the slope for a 1 unit (year) change in time is 0.071, a 10 year change is

    simply 10 x slope)

    10. D11. A (Predicted Time = 17.554+0.071*(21)0.86*(0) = 19.045;

    Residual = ActualPredicted = 18.5019.045)

    12. D (All of the p-values for the coefficients are below the 0.05 threshold for

    significance; All of the t-statistics are above 2.0 in absolute valuethe

    rule-of-thumb value for significance

    Section III

    Statistical Software

    Personal computers and software make it possible for almost anyone to

    complete complicated or lengthy computations needed for statisticsknowing what to do with them is the hard part

    Excel contains many useful statistical and graphing capabilities; theseare introduced in the next few slides

    Software dedicated to statistical operations vastly expands the breadthof procedures possible as well as doing some much easier than inExcel. Some commonly used statistical software includes SAS (; the company offers many varieties; JMP is a point-click

    product; SAS is available in some places at WKU

    SPSS (; This software is available in most computer labs oncampus; it is not as widely used by economists as SAS but contains most of thesame features, especially for basic purposes

    Stata ( is widely used by economists and contains broad and verypowerful tool; Eviews ( is also very powerful and especiallyuseful for time series and forecasting applications; both provide point-clickfunctionality
    Excel Stat Introduction 1

    Making Application While there is no self-test with this section, you are strongly encouraged

    to practice on Excel; even if you use other software in later classes, the

    practice in Excel will be helpful

    One of the main differences in Excel and spreadsheets in statistical

    software is that Excel is address driven (each cell has an address),

    whereas the stat software is variable drivenonce a column of data

    exists for a variable, the entire column can be manipulated simply by

    referring to the name

    Excel Stat Introduction 2

    Click the Tools menu in Excel; if Data Analysis appears as anoption you may skip to the next slide; if not then

    Select the Add-Ins option under the Tools menu

    Check the box for Analysis Tool Pak

    The Data Analysis option should now appear under the Tools menu

    (Note: If you opened Excel from your desktop, the procedures above

    should work; if you happened to open Excel by opening an Excel-

    based spreadsheet while browsing on the internet, it may not work)

    Excel Stat Introduction 3

    Take one of the data sheets, Air Travel or 5k Times, used in thistutorial and enter the data into Excel. The instruction here proceed

    using the Air Travel data.

    To compute descriptive statistics for a variable

    Select the Tools menu Select the Data Analysis option

    Select the Descriptive Statistics option

    Click on the icon next to the blank for Input Range

    Highlight the column for Fare including the label

    Check the Labels in the First Row box Check the Summary Statistics box

    Check the Confidence Interval for the Mean box

    Check the OK button

    Excel Stat Introduction 4

    You should now have an output table on a new sheet One disadvantage of Excel is that statistical output table like this one tend

    to be collapsed or condensed and need to be formatted

    Formatting the output table (this is something you should always do in

    Excel) Highlight the columns with the table

    Select the Format menu

    Select the Column and AutoFit Selection options

    Again, select the Format menu

    Select the Cells options In the Number menu, choose the Number option

    Pick a number for the Decimal Places box (the number of decimal

    places depends somewhat on the data3 will be fine here)

    Make sure to do this step in Excel; tables with a lot of insignificant decimal

    places are very messy to read

    Excel Stat Introduction 5

    Return to the original data sheet Create a regression analysis:

    Select the Tools menu and the Data Analysis option

    Select the Regression Analysis option in the window

    Select the icon next to the Input Y Range blank and highlight the data

    containing Fare including the label Select the icon next to the Input X Range and highlight the data

    containing Distance and Direct SWA including the labels

    (Note: if you try to highlight the whole columns you may get an error)

    Check the Labels box, the Residuals box, and the Line Fit box

    Select the OK button and reformat the output tables as before You will also need to reformat the Line Fit plot (another small hassle in

    Excel); just expand it using the mouse

    Excel Stat Introduction 6

    Return to the original data sheet Charts in Excel

    Excel can also be used to create scatterplots, histograms, and other types

    of plots

    This is an area where statistical software is much easier to use

    If you want to tinker some, click on the Chart Wizard icon that shouldappear below the top level menus

    The icon has the appearance of a bar chart

    Also, under the Data menu, there is a Pivot Table and Pivot Chart

    option that provides further capabilities

    If you would like a hands-on introduction to other statistical software,

    please contact Brian Goff Also, other several

    other economics professors can provide assistance in becoming

    acquainted with software.
    Probability Distributions

    A final topic briefly introduced here is that of probability distributions(PD)

    A PD is a formula (often presented as a graphic or table) that links

    values of a variable with the probability of those values

    PDs are used in many ways; for statistics, one of the key uses is to

    assess hypotheses including the use of t-statistics and p-values

    Statistical software makes an extensive knowledge of PDs not

    necessary because the relevant information about the PD is stored bythe computer and used as needed; however, a few basic points are

    worthwhile even for basic statistics users

    Probability Distributions 2 PDs have a center, dispersion, and symmetry or skew (asymmetry)

    measures of location of center include mean & median

    measures of dispersion include the standard deviation and range

    PDs have tails (the ends), measured by the amount of kurtosis

    Normal (Probability) Distribution

    Most widely known due to its bell-shape

    Many real life situations are approximately (though not perfectly) distributedNormal

    The mother of PDs in that many other distributions are related to it or convergeto it with large samples or other conditions


    Also bell-shaped

    Is wider in its tails than the normal but converges to it with large samples

    Binomial Distributiondeals with 2 outcome situations

    F-Distribution, Chi-Square Distributioncommonly used distribution whenthe topic is variability

    Excel permits PDs to be used directly if desired

    Excel permits PDs to be used directly if desired

    Click on the function icon (the script f) just below the top menus

    Select Statistical in the window and scroll to the desired distribution such as

    NORMDIST for normal

    We can now produce probabilities for a variable assumed to be normal or near


    Example: Lets assume that male height is apx. Normal with a mean of 70 inches

    and a standard deviation of 2 inches, what is the probability of finding someone

    taller than 74? In the NORMDIST window, plug in 74 for X, 72 for Mean, and 2 for

    Standard Deviation

    In the Cumulative box, put True.

    Excel will produce a number that is the probability of being 74 or less (that is, the

    cumulative probability)

    This number is 0.977

    The probability of being taller than 74 is 1-.977 = .023 or 2.3%

    The same or similar procedures can be used for 2 outcome (binomial) problems

    or many others and opens up a wide array of uses

    Clockwise from Left Corner:

    Clockwise from Left Corner:

    Normal, t-, F-, and Chi-Square Distributions

    A gallery of PDs and more background is

    offered at the Engineering Statistics


