Lecturer: James Degnan O ce: SMLC 342 O ce hours: MW 1:00

Lecturer: James DegnanOffice: SMLC 342Office hours: MW 1:00-3:00 or by appointmentE-mail: [email protected] include STAT476 or STAT576 in the subject line of the emailto make sure I don’t overlook your email.

Textbook: Methods of Multivariate Analysis 2nd edition, by Alvin C.Rencher

Assessment: Grading will be based homework (roughly 4 assignments inthe semester) (30%), two in-class tests (20% each), and a final exam(30%).

SAS Programming January 16, 2015 1 / 43

We will mostly use R for computing, but may also consider SAS dependingon student interest. Solutions to homework can be done in any softwarepackage, but it will be easier for me to grade and give partial credit forhomework done in R and SAS.

HomeworkFor turning in computer-based homework, turn in all computer codeused as an appendix only. Do not include computer code as part ofyour solutions. Figures and tables can be generated from computeroutput, but solutions must be discussed separately from the output, andthe results in the Figures and Tables should be cited in the homeworksolutions. This will be discussed further in class.

Late homework will be penalized 10% per day. All homework must beprinted (not emailed) and turned in either in class or to my office. Slidinghomework under my door is fine.


Topics

We’ll cover mostly topics from the listed book. Some examples include

I Reviews of matrix algebra mostly as needed

I Multivariate distributions, joint PDFs and CDFs, multivariate normal

I Hotelling’s T 2

I MANOVA (multivariate ANOVA)

I MANCOVA (multivariate ANCOVA)

I Multivariate regression

I Correlation analysis

I Cluster Analysis

I Principal Components Analysis (PCA)

I Multdimensional Scaling (MDS)

I Factor Analysis


What does multivariate statistics refer to?

How is multivariate analysis different from say, multiple regression?

Many statistical models are of the form

Response = Predictors + error

where the Response is a single (i.e., univariate) random variable and thePredictors consist of multiple variables. Typically the predictors are treatedas non-random. This is especially clear in designed experiements, wherethe predictors are controlled by the experimenter, and the only thingrandom is the response.



Often, however, there are covariates among the predictors which are notdirectly in the experimenter’s control. In analysis of variance, you mighthave multiple treatment groups, such as patients being assigned placeboversus various drugs. The response might be, say total cholesterol level.Covariates might include things like sex and age of the patients in thestudy, which are not under the experimenter’s control, and would lead toan analysis of covariance (ANCOVA). Usually in a model like this, eventhough the covariates are not controlled, a linear model treats the responseas random, but not the predictors, including sex and age of the patients.



In the previous example, the total cholesterol level was the variable ofinterest, and understanding variation in cholesterol requires accounting foreffects in the covariates, but otherwise the covariates are not of interest.

In multivariate techniques, we are usually interested two or more randomvariables and understanding how they co-vary, their joint distribution, andpossibly how they are related to predictors and covariates that are notdirectly of interest.

To take the cholesterol example, we might distinguish between “good”cholesterol (high density lipoprotein, or HDL), and “bad” cholesterol (lowdensity lipoprotein, or LDL). If both of these are measured simultaneously,we might want a model that predicts both values as a function of thepredictors.



To continue the cholesterol example, we might have LDL and HDL bothas response variables, so that the model looks like this:

(HDL,LDL) = Drug + Dose + Age + Sex + error

Alternatively, consider this model

LDL = HDL + Drug + Dose + Age + Sex + error

The first model is multivariate, while the second is not using thisterminology, even though both models use the same variables.



In many statistics problems, multiple measuresments are made onindividuals and a somewhat arbitrary decision is made regarding which isthe response (and therefore treated randomly) and which are thepredictors (treated nonrandomly). In simple linear regression, we think ofthe x values as nonrandom and the y values as random, but for manyexamples, particularly in observational data as opposed to designedexperiments, both variables are equally random.

Historically, regression was invented by Francis Galton (cousin of CharlesDarwin of Origin of Species fame) in England. He coined the term“regression” in analyzing data on heights of parents compared to sons.


Regression of Son’s height’s on parent’s heights



In the case of parent’s heights versus son’s heights, it seems silly to thinkof one person’s height as random while another person’s height isnonrandom. On the other hand, it might make more sense to think ofparent’s heights as predicting their children’s heights rather than children’sheights as predicting their parents’ heights.

Other examples might be more symmetric, such as heights of husbandsversus heights of wives (Galton also analyzes this case, using a 3x3contingency table with heights classified as Tall, Medium, and Short). Aregression here needs to make an arbitrary decision about which of thehusband or the wife is considered the random response. Generally, aregression of the husbands’ heights on the wive’s heights will lead to adifferent trendline than a regression of the wives’ heights on the husbands’heights.

Considering the two heights in a multivariate way, we instead think aboutthe joint distribution of the two variables and consider them both random.


Why use multivariate statistics?

Apart from the example just given, a justifcation for multivariate techniques instatistics is that we can sometimes get more information from the jointdistribution of random variables than by looking at marginal distributions (i.e.,the distribution of Y given different levels of X ).

Multivariate techniques can be more computationally intensive than univariatetechniques, and often matrix methods are used. Techniques such as computingthe inverse of a matrix, eigenvalues and eigenvectors, the determinant of amatrix, and singular value decompositions often arise. If data is high-dimensional(lots of variables), it might be particularly difficult to use multivariate techniques.We will often look at cases with fairly low dimensionality. Later in the course, wewill also consider methods of dimension reduction.

While much of statistics is oriented around statistical inference, a lot of

multivariate techniques are concerned with visualization and gaining qualitative

insights into data. Consequently, a number of multivariate techniques are

descriptive and exploratory rather than inferential.


Matrix review: notation

A note on notation. A common convention in statistics is to use uppercase Roman letters for random variables and lower case Roman letters forvalues of random variables (i.e., X versus x in the expression P(X = x)).A very strong convention in mathematics is to use upper case letters formatrices. Depending on the book, the two most common conventions areto either use upper case, (mostly) Roman, bold, nonitalic letters (e.g., A)or to use upper case, (mostly) Roman, italic, nonbold letters (e.g., A). Ifbold characters are used for matrices, then lower case, bold, nonitalicletters are typically used for vectors (e.g., y). The mathematicalconvention seems to be the stronger, and in multivariate settings, there isoften no notational difference between random and nonrandom objects.


Matrix review: notation

Our textbook uses bold, nonitalic letters for matrices and vectors, whetherthey are random or not, and that is the convention I will follow also. Notethat Christensen’s books use nonbold, italic characters for matrices andvectors. Greek letters are also sometimes used for vectors to indicate aparameter that is also a vector, such as µ versus µ.


Matrix review

Note that chapter 2 of the book has a fairly thorough matrix review whichI am following. As an undergraduate, I took Math 321 at UNM ratherthan Math 314. I found that Math 321 didn’t prepare me very well for thematrix algebra needed in statistics in graduate school, although it wasgood preparation for some other concepts like linear spaces and vectorspaces needed for Advanced Linear Models.

At some point in grad school, I picked up the textbook being used forMath 314 to give me more practice with more matrix manipuation (asopposed to proofs) and for certain concepts like eigenvectors andeigenvalues.


Matrix review

A matrix is a rectangular array of numbers with n rows and p columns.These variables are often used in statistics with n representing the samplesize and p the number of parameters (for example, in a regressionproblem).

We index a matrix A by elements aij where 1 ≤ i ≤ n and 1 ≤ j ≤ p.Often we write A = (aij) to emphasize the notation.

We can think of a column vector as an n × 1 matrix and a row vector as a1× p matrix. If something is written as a vector without specifyingwhether it is a column vector or row vector, the default assumption is thatit is a column vector unless the context requires it to be a row vector.


Matrix review: Transpose

A column vector would be written as (for example)

x =

x1x2x3

Just as matrix elements can be indexed by subscripts (and using a nonboldcharacter), so can vectors, such as if x3 referring to the third element of avector.The transpose of a matrix or a vector is written with an apostraphe, suchas

x = (x1 x2 x3)′, x′ = (x1 x2 x3)

The transpose of a matrix changes an n × p matrix into a p × n matrix. IfB = A′, then bji = aij for all 1 ≤ i ≤ n and 1 ≤ j ≤ p.


Matrix review: special matrices

The transpose of a matrix changes the first column of the old matrix to bethe first row of the new matrix, and generally the ith row of the firstmatrix to be the ith column of the second matrix. Note that (A′)′ = A.

A matrix is square if n = p.

A square matrix A is symmetric if and only if aij = aji for all i and j .Similarly, A is symmetric if and only if A = A′.


Matrix review

The diagonal of a square matrix refers to the elements aii , 1 ≤ i ≤ n.

A diagonal matrix, often written D has 0s for all nondiagonal entries. Thatis, if D is diagonal, then dij = 0 for i 6= j .A special function called diag() is used to deal with diagonal matrices. Forexample,

diag(1, 2,−4) =

1 0 00 2 00 0 −4

You can also use diag to set all nondiagonal entries to 0. E.g.,

diag

5 4 90 3 41 2 1

=

5 0 00 3 00 0 4


Matrix review: special matrices

The identiy matrix I plays a role similar to the number 1 in arithmetic andis a square, diagonal matrix with 1s on the diagonal. For any n × p matrixA, we have In×nA = A and AIp×p.

A vector of 1s is denote by j and a matrix of all 1s is denoted by J. Thedimension might be clear from context or you can write things like In×nand J3×3 to help keep track of the dimension.

A matrix of 0s is denoted by O.

A square matrix is upper triangular if all entries below the diagonal are 0,i.e. if aij = 0 for i > j . Zeroes on the diagonal are ok. A square matrix islower triangular if all entries above the diagonal are 0.


Matrix review: matrix operations

The most important operations are addition (and substraction), matrixmultiplication, and scalar multiplication.If two matrices A and B have the same dimension, then C = A + B isdefined by cij = aij + bij .

Note that if two matrices, A and B are identical, then aij = bij for each iand j , and therefore A− B = 0, rather than 0.



To multiply two matrices An×p and Bp×q, the number of columns of thefirst matrix must equal the number of rows of the second matrix (they arecalled conformable). The resulting matrix C = A× B is defined by

cij =

p∑k=1

aikbkj

You can think of this as the dot product of the ith row of A with the jthcolumn of B. The resulting matrix has dimension n × q. This takes a bitof getting used to and can be slow to perform by hand.

Even if the product AB is defined, then the product BA is not definedunless the number of columns of B matches the number of rows of A. Forexample, if A is 2× 3 and B is 3× 2, then AB is 2× 2 and BA is 3× 3.

If A and B are both n × n, then so is their product; however, typically ABis still not equal to BA.



In statistics, it is common to work with a design matrix X. Often insteadof working with this matrix directly, the matrix X′X is used. A usefulfeature of this matrix is that it is symmetric. A useful review is to give anexample and try to prove the general case.As an example, in a regression problem you might have the following:4

52

=

1 31 21 3

(β0β1

)+

e1e2e3

In this problem the design matrix X is the 3× 2 matrix. Part of thesolution of the regression problem involves X′X. We’ll compute thisproduct for this example



We have

C = X′X =

(1 1 13 2 3

)1 31 21 3



Thus,

c11 = 1 · 1 + 1 · 1 + 1 · 1 = 3

c12 = 1 · 3 + 1 · 2 + 1 · 3 = 8

c21 = 3 · 1 + 2 · 1 + 3 · 1 = 8

c22 = 3 · 3 + 2 · 2 + 3 · 3 = 22

Finally,

C = X′X =

(3 88 22

)



Something to notice about this answer is that the matrix is square andsymmetric (since x12 = x21) even though X was not square. Can you showthat this is true for any matrix X with any number of dimensions?



To show that X′X is symmetric, if X is n × p, then X′ is p × n, andtherefore X′X is p × p., so it is square.

We’d like to also show that for C = X′X, C = C′, meaning that C issymmetric.If we look at the elements of C, it is symmetric if cij = cji for an arbitrarychoice of i and j . Let B = X′ so that bij = xji . Applying the rules formatrix multiplication,

cij =n∑

k=1

bikxkj =n∑

k=1

xkixkj

Also,

cji =n∑

k=1

bjkxki =n∑

k=1

xkjxki = cij

Thus, C = C′, so X′X is symmetric.SAS Programming January 16, 2015 26 / 43


A more general property of transposes is that for two conformablematrices, A and B,

(AB)′ = B′A′

Thus,(X′X)′ = X′(X′)′ = X′X


Properties of matrix multiplication

Most properties of addition and multiplication carry over to matrices, witha few differences. Here are some properties:

I A(B + C) = AB + AC (Distributive laws)

I (A + B)C = AC + BC

I (A + B)(C + D) = AC + AD + BC + BD

I A(BC) = (AB)C (associativity)

I for vectors x, y, z

(x− y)′(x− y) = x′x− 2x′y + y′y

The last property relies on the fact that for n × 1 vectors, x′y = y′x, sincethese products results in single numbers (scalars).


Matrix review

Note that if a and b are n× 1 vectors, then a′b is a scalar (single number):

a′b =n∑

i=1

aibi

Consequently,a′b = (a′b)′ = b′a

while C = ab′ is n × n. An example is that jj′ = Jn×n

The length of a vector is√

a′a =√∑n

i=1 a2i .


The inverse of a matrix

Another important concept is the inverse of a matrix. If you have a matrixequation such as

A = BC

You might need to solve for C in terms A and C. For matrices, divisionisn’t defined the way it is for numbers, so we have to do something else.For the equation

x = yz

where the values are numbers, we normally devide both sides by y . Wecan think of this as multiplying both sides by the multiplicative inverse ofy , which is y−1. This is the approach to take with matrices.


The inverse of a matrix

Suppose there is a matrix X such that XB = I, the identity matrix. Then

A = BC⇒ XA = XBC

⇒ XA = IC = C

We say that X is an inverse of B and write X = B−1.Similar to transposes, (AB)−1 = B−1A−1.

The general idea is that the product of a matrix and it’s inverse is theidentity matrix with 1s on the diagonal.

We won’t review general methods of computing inverses by hand, which istedious except for 2× 2 matrices (which we might use later). But youshould be able to compute inverses in software.


Matrix operations in R

You should be able to do some matrix operations in R, but you should alsobe to do them in some computer program. We’ll go over how to do themin R.

First you need to have access to R. This can be downloaded and installedfor free on a personal computer, or used in some of the computer pods.You can also use R online (without downloading) athttp://www.tutorialspoint.com/execute r online.php



We’ll practice using the regression example from before with y = (452)′

and predictor x = (3, 2, 3)′. The design matrix X also has a column of 1s.



Operator or Function Description

A * B Element-wise multiplication

A %*% B Matrix multiplication

A %o% B Outer product. AB’

crossprod(A,B)

crossprod(A) A’B and A’A respectively.

t(A) Transpose

diag(x) Creates diagonal matrix with elements of x in the principal diagonal

diag(A) Returns a vector containing the elements of the principal diagonal

diag(k) If k is a scalar, this creates a k x k identity matrix. Go figure.

solve(A, b) Returns vector x in the equation b = Ax (i.e., A-1b)

solve(A) Inverse of A where A is a square matrix.

ginv(A) Moore-Penrose Generalized Inverse of A.

ginv(A) requires loading the MASS package.

y<-eigen(A) y$val are the eigenvalues of A

y$vec are the eigenvectors of A

y<-svd(A) Single value decomposition of A.


Back to matrix review

Another bit of notation that is sometimes conventient is to use a′i todenote the ith row of matrix A and bj to denote the jth column of B.Then we can write the ijth element of C = AB as a′ibj .

The jth column of C is Abj , and the ith row of C is a′iB.


Matrix review: scalar multiplication

Multiplying by a scalar results in multiplying each element by the sameamount. For cA, the ijth element is caij . Since scalar multiplication iscommutative, cA = Ac .


Quadratic form

A quadratic form is a product of the form

y′Ay =∑i

aiiy2i +

∑i 6=j

aijyiyj

A bilinear form is a product of the form

x′Ay =∑ij

aijxiyj

Since these are both scalars, you can have expressions such as

1

x′Ay

even though an expression like 1/A is undefined.


Partitioned matrices

It’s frequently convenient to partitition matrices, like this a11 a12 b11a21 a22 b21c11 c12 d11

This can also be represented (

A11 A12

A21 A22

)where A11 is 2× 2, A12 is 2× 1, A21 is 1× 2, and A22 is 1× 1.


Partitioned matrices

If A and B are conformable and are both partitioned so that theirpartitions are conformable (i.e., Aij and Bij are conformable), then theproduct can be performed on the submatrices. That is C = AB will have asubmatrix Cij which is defined analogously to cij in the usual matrixmultiplication. For example,(Example on board....)


Rank of a matrix

The rank of a matrix is the number of linearly independent columns orlinearly independent rows. A set of vectors y1, . . . , yk is linearlyindependent if

c1y1 + · · ·+ ckyk = 0

if and only if c1 = · · · ck = 0. In other words, if you can find constantsc1, . . . , ck where at least one of them is nonzero, yet the linear combinationis 0, then the vectors are not linearly independent. This occurs for exampleif two vectors are identical, one vector is the sum of the other two, etc.

In statistics, the design matrix X might not have linearly independentcolumns, in which case we say that X is “not of full rank”. This can alsolead to X′X not being invertible, which leads to the need for generalizedinverses linear model theory.


Positive definite

A matrix A is positive definite if y′Ay > 0 for all y 6= 0 and positivesemi-definite if y′Ay ≥ 0 for all y 6= 0.

A nice thing about matrices of the form X′X is that they are positivedefinite. To show this, let A = X′X. Assume that X is n × p and of fullrank p < n. Then

y′Ay = y′X′Xy = (Xy)′(Xy) = z′z

for z = Xy. (Note that if X is n× p and y is p × 1, then z is n× 1.) HereXy 6= 0 because Xy is a linear combination of the columns of y, and sinceX is full rank, then linear combination cannot equal 0.

If X is not of full rank, then X′X is positive semidefinite instead since youcould have Xy = 0.


Cholesky decomposition

Matrices like X′X are often analogous to squares in algebra. If you go inthe other direction, you can think of factoring a square matrix into aproduct of two matrices, such as A = T′T, in which case T is analogousto a square root of A.

This can be done using the Cholesky decomposition.


Cholesky Decomposition

t11 = a11, t1j =a1jt11

2 ≤ j ≤ n

tii =

√√√√aii −i−1∑k=1

t2ki

tij =

{aij−

∑i−1k=1 tki tkjtii

i < j

0 otherwise


Documents

Lecturer: James Degnan O ce: SMLC 342 O ce hours: MW 1:00