correlation and percentages

• association between variables can be explored using counts– are high counts of bone needles

associated with high counts of end scrapers?

• similar questions can be asked using percent-standardized data– are high proportions of decorated pottery

associated with high proportions of copper bells?

but…• these are different questions with

different implications for formal regression

• percents will show some correlation even if underlying counts do not…– ‘spurious’ correlation (negative)– “closed-sum” effect

case C_v1 C_v2 C_v3 C_v4 C_v5 C_v6 C_v7 C_v8 C_v9 C_v10

1 15 14 94 59 76 13 8 97 10 95

2 35 1 89 95 23 77 14 9 27 43

3 20 96 73 31 90 65 74 60 85 27

4 23 59 7 52 33 83 71 35 57 90

5 36 90 86 15 97 54 52 41 34 3

6 79 2 26 5 11 68 74 44 13 87

7 40 99 28 66 77 23 69 22 63 36

8 95 36 22 75 21 48 95 58 74 68

9 27 0 58 99 32 30 5 5 100 75

10 67 93 98 61 62 94 3 16 43 48

10 vars.5 vars.

3 vars.2 vars.

matrix(round(rnorm(100, 50, 15), nrow=10)))

-1.0 -0.5 0.0 0.5 1.0r

original counts

-1.0 -0.5 0.0 0.5 1.0r

%s (10 vars.)

-1.0 -0.5 0.0 0.5 1.0r

%s (5 vars.)

-1.0 -0.5 0.0 0.5 1.0r

%s (3 vars.)

-1.0 -0.5 0.0 0.5 1.0r

%s (2 vars.)

0 20 40 60 80 100C_V1

0 5 10 15 20P10_V1

0 10 20 30 40 50 60 70T5_V1

10 20 30 40 50 60 70 80T3_V1

10 20 30 40 50 60 70 80 90 100T2_V1

original counts %s 10 vars.

%s 5 vars. %s 3 vars. %s 2 vars.

outliers

• including outliers in regression analyses is usually a bad idea…

• Tukey-line / least squares discrepancies are good red-flag signals

2 4 6 8 10

0 50 100 150 200 250

soMort$SO2

“convex hull trimming”

0 1 2 3 4 5

log(soMort$SO2)

0 1 2 3 4 5

log(soMort$SO2)

“convex hull trimming”

> hull1 chull(x, y)

> plot(x, y)

> polygon(x[hull1], y[hull1])

> abline(lm(y[-hull1] ~ x[-hull1]))

0 1 2 3 4 5 6

log(soMort$SO2)

transformation

• at least two major motivations in regression analysis:– create/improve a linear relationship– correct skewed distribution(s)

• ex: density of obsidian vs. distance from the quarry:

0 10 20 30 40 50 60 70 80DIST

Plot of Residuals against Predicted Values

-1 0 1 2 3 4ESTIMATE

0 10 20 30 40 50 60 70 80DIST

LG_DENS log(DENSITY)

old.par par(no.readonly = TRUE)

plot(DIST, DENSITY, log="y")par(old.par)

0 50 100 150 200VAR1

0 5 10 15VAR1T

> VAR1T sqrt(VAR1)> plot(VAR1T, VAR2)

transformation summary

• correcting left skew:x4 stronger

x3 strong

x2 mild

• correcting right skew:x weak

log(x) mild

-1/x strong

-1/x2 stronger

“coefficient of determination”

• regression/correlation– the strength of a relationship can be

assessed by seeing how knowledge of one variable improves the ability to predict the other

• if you ignore x, the best predictor of y will be the mean of all y values (y-bar) – if the y measurements are widely

scattered, prediction errors will be greater than if they are close together

• we can assess the dispersion of y values around their mean by:

2)( yyi

2)ˆ( ii yy

2)( yyir2=

• “coefficient of determination” (r2)

• describes the proportion of variation that is “explained” or accounted for by the regression line…

• r2=.5 half of the variation is explained by the regression…

half of the variation in y is explained by variation in x…

“explaining variance”

multiple regression

residuals

• vertical deviations of points around the regression – for case i, residual = yi-ŷi [yi-(a+bxi)]

• residuals in y should not show patterned variation either with x or y-hat

• should be normally distributed around the regression line

• residual error should not be autocorrelated (errors/residuals in y are independent…)

• residuals may show patterning with respect to other variables…

• explore this with a residual scatterplot– ŷ vs. other variables…

• are there suggestions of linear or other kinds of relationships?

• if r2 < 1, some of the remaining variation may be explainable with reference to other variables

• paying close attention to outliers in a residual plot may lead to important insights

• e.g.: outlying residuals from quantities of exotic flint ~ distance from quarries– sites with special access though transport

routes, political alliances…

• residuals from regressions are often the main payoff

Middle Formative,

Basin of Mexico

Formative Basin of Mexico

• settlement survey

• 3 variables recorded from sites:– site size (proxy for population)– amount of arable land in standard “catchment”– productivity index for soils

How are these variables related?

Do any make sense as dependent or independent variables?

1. SIZE (ha)

2. AGLAND (km2)

3. PROD (index)

0 10 20 30 40 50 60 70 80 90 100AGLAND

SIZE ~ AGLAND

r2 = .75 y = 35.4 + .66xSIZE = 35.38 + .66*AGLAND(ha) (km2)

0 10 20 30 40 50 60 70 80 90 100AGLAND

residuals??

> resSize frmdat$size – (35.4 +.66 * frmdat$agland)

residual SIZE = SIZE – SIZE-hat

0.7 0.8 0.9 1.0 1.1 1.2 1.3

frmdat$prod

0.7 0.8 0.9 1.0 1.1 1.2 1.3PROD

PROD & SIZE

r2 = .69SIZE = -29 + 98 * PROD

0 10 20 30 40 50 60 70 80 90 100AGLAND

0.7 0.8 0.9 1.0 1.1 1.2 1.3PROD

r2 = .75

r2 = .69

What have we “explained” about site

size??

20 40 60 80

agland

30 50 70 90 0.7 0.8 0.9 1.0 1.1 1.2 1.3

0 10 20 30 40 50 60 70 80 90 100AGLAND

r2 = .55

multiple regression…

1 = total variance observed in independent variable (x0)

2011 r

variance in x0 explained by x1, by itself…

variance in x0 unexplained by x1…

2021 r

variance in x0 explained by x2, by itself…

variance in x0 unexplained by x2…2

)1( 202

22.01 rr

12020122.01

partial correlation coefficient: proportion of variance in x0 explained by x1, that is not explained by x2…

(total variance in x0 explained by x1, that is not explained by x2…)

)1( 202

212.0 rrrR

multiple coefficient of determination: variance in x0 explained by x1 and x2, both separately, and together…

AGLAND PROD

SITE-SIZE

productivity

agricultural land

SIZE = -1.8 + .42*AGLAND + 50*PROD

y = -1.8 + .42x1 + 50x2

size = -1.8 + .42*agland + 50*prod

• various scales are involved:size hectaresagland km2

prod productivity index

• increasing available agricultural land by 1 km2 increases site-size by about .4 hectares

• a 1-unit increase of soil productivity increases site-size by about 50 hectares

• which of these two factors is more important??

• calculate “beta” coefficients to eliminate the effect differing scales…

• convert the variables to Z-scores– mean of 0 – standard deviation of 1

• repeat multiple correlation analysis…

with(frmdat, {Bsize (size-mean(size))/sd(size)Bagland (agland-mean(agland))/sd(agland)Bprod (prod-mean(prod))/sd(prod) })

lmBeta lm(Bsize ~ Bagland + Bprod)

size = .55*agland + .43*prod

doesn’t change…should be zero…

site size

productivity

agricultural land

r2=.83

r2=.55

correlation and percentages

Documents

Percentages What are Percentages?. Percentages are measures out of a 100 Percentages

Unit 3 Using Fractions and Percentages

Fractions, Decimals, and Percentages REVIEW CONCEPTS

and percentages-

Percentages Decimal

Percentages and discounts tutorial

Percentages (I)

STRAND A: Computation A2 Using Fractions and Percentages · Mathematics SKE, Strand A UNIT A2 Using Fractions and Percentages: Text A2.3 Quantities as Percentages To answer questions

7.2 Percentages:

Percentages – Clothes

Multiplying fractions and decimals and percentages

Levels of Reflective Teaching among the Student Teachers ... · Percentages, means, Pearson correlation formula, Alpha Cronbach, Kuder-Richardson Formulas 21 ,Cooper Test, T test

SERIES Fractions, Decimals and Percentages

Decimals and Percentages

“closed-sum” effect. correlation and percentages association between variables can be explored using counts –are high counts of bone needles associated

High and Low Conversion Percentages

Decimals and Percentages Maths Puzzle … and Percentages Maths Puzzle

Correlation and Auto Correlation

Percentages, Fractions and Decimals · Finding percentages of amounts A good way to find percentages is to use our knowledge of the connection between fractions and percentages. Our

Rounding and Percentages