Upload
corey-whitehead
View
214
Download
0
Tags:
Embed Size (px)
Citation preview
Dimension Reduction Methods
• statistical methods that provide information about point scatters in multivariate space
• “factor analytic methods”– simplify complex relationships between cases
and/or variables– makes it easier to recognize patterns
• identify and describe ‘dimensions’ that underlie the input data– may be more fundamental than those directly
measured, and yet hidden from view
• reduce the dimensionality of the research problem– benefit = simplification; reduce number of variables
you have to worry about
• identifying sets of variables with similar “behaviour”
How?
Basic Ideas
• imagine a point scatter in multivariate space:– the specific values of the numbers used to describe the
variables don’t matter
– we can do anything we want to the numbers, provided they don’t distort the spatial relationships that exist among cases
• some kinds of manipulations help us think about the shape of the scatter in more productive ways
• imagine a two dimensional scatter of points that show a high degree of correlation …
x
y
bar-x
bar-y
orthogonal regression…
Why bother?
• more “efficient” description– 1st var. captures max. variance – 2nd var. captures the max. amount of residual
variance, at right angles (orthogonal) to the first
• the 1st var. may capture so much of the information content in the original data set that we can ignore the remaining axis
• other advantages…
• you can score original cases (and variables) in new space, and plot them…
• spatial arrangements may reveal relationships that were hidden in higher dimension space
• may reveal subsets of variables based on correlations with new axes…
length
width
“size”
“shape”
Storage / Cooking
Cooking
PUBLIC PRIVATE
DO
ME
STIC
RIT
UA
L
Ritual
candelero
Service?
Principal Components Analysis (PCA)
why:• clarify relationships among variables• clarify relationships among cases
when:• significant correlations exist among variables
how:• define new axes (components)• examine correlation between axes and variables• find scores of cases on new axes
r = 0r = -1r = 1
x4
x3
x2
x1
pc2
pc1
component loading
eigenvalue: sum of all squared loadings on one component
eigenvalues
• the sum of all eigenvalues = 100% of variance in original data
• proportion accounted for by each eigenvalue = ev/n (n = # of vars.)
• correlation matrix; variance in each variable = 1– if an eigenvalue < 1, it explains less variance than one
of the original variables
– but .7 may be a better threshold…
• ‘scree plots’ – show trade-off between loss of information, and simplification
Mandara Region knife morphology
J. Yellen – San ethnoarchaeology (1977)
• CAMP: the camp identification number (1-16.)• LENGTH: the total number of days the camp was occupied.• INDIVID: the number of individuals in the principal period of
occupation of the camp. Note that not all individuals were at the camp for the entire LENGTH of occupation.
• FAMILY: the number of families occupying the site.• ALS: the absolute limit of scatter; the total area (m²) over
which debris was scattered.• BONE: the number of animal bone fragments recovered from the site.• PERS_DAY: the actual number of person-days of occupation (not
the product of INDIVID*LENGTH—not all individuals were at the camp for the entire time.)
Correspondence Analysis (CA)
• like a special case of PCA — transforms a table of numerical data into a graphic summary
• hopefully a simplified, more interpretable display deeper understanding of the fundamental
relationships/structure inherent in the data
• a map of basic relationships, with much of the “noise” eliminated
• usually reduces the dimensionality of the data…
• derived from methods of contingency table analysis
most suited for analysis of categorical data: counts, presence-absence data
• possibly better to use PCA for continuous (i.e., ratio) data
• but, CA makes no assumptions about the distribution of the input variables…
CA – basic ideas
• simultaneously R and Q mode analysis
• derives two sets of eigenvalues and eigenvectors ( CA axes; analogous to PCA components)
• input data is scaled so that both sets of eigenvectors occupy very comparable spaces
• can reasonably compare both variables and cases in the same plots
CA output
• CA (factor) scores– for both cases and variables
• percentage of total inertia per axis– like variance in PCA; relates to dispersal of points
around an average value– inertia accounted for by each axis distortion in a
graphic display
• loadings– correlations between rows/columns and axes– which of the original entities are best accounted for by
what axis?
“mass”
• as in PCA new axes maximize the spread of observations in rows / columns– spread is measured in inertia, not variance
– based on a “chi-squared” distance, and is assessed separately for cases and variables (rows and columns)
• contributions to the definition of CA axes is weighted on the basis of row/column totals– ex: pottery counts from different assemblages; larger
collections will have more influence than smaller ones
“Israeli political economic concerns”
residential codes:
As/Af (Asia or Africa)
Eu/Am (Europe or America)
Is/AA (Israel, dad lives in Asia or Africa)
Is/EA (Israel, dad lives in Europe or America)
Is/Is (Israel, dad lives in Israel)
“Israeli political economic concerns”
“worry” codesENR Enlisted relativeSAB SabotageMIL Military situationPOL Political situationECO Economic situationOTH OtherMTO More than one worryPER Personal economics
As/Af Eu/Am Is/AA Is/EA Is/IsENR 61 104 8 22 5SAB 70 117 9 24 7MIL 97 218 12 28 14POL 32 118 6 28 7ECO 4 11 1 2 1OTH 81 128 14 52 12MTO 20 42 2 6 0PER 104 48 14 16 9
Ksar Akil – Up. Pal., Lebanon
Data> Frequency> COUNT
Statistics> Data Reduction> CA
Correspondence Plot
-1.0 -0.5 0.0 0.5 1.0Dim(1)
-1.0
-0.5
0.0
0.5
1.0D
im(2
)
10
9
3
45
2
61
8
7
PC
BT
NC
BL
FB
Correspondence Plot
-1.0 -0.5 0.0 0.5 1.0Dim(1)
-1.0
-0.5
0.0
0.5
1.0
Dim
(2)
10
9
3
45
2
61
8
7
PC
BT
NC
BL
FB
partCor
nonCor
flak eB d
blade
bladelet
-0.5 -0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4 0.5
Dim ens ion 1; E igenvalue: .07609 (59.41% of Inert ia)
-0.3
-0.2
-0.1
0.0
0.1
0.2
0.3
0.4
0.5
Dim
ension 2; Eigenvalue: .04095 (31.97%
of Inertia)
1
2
3
4
5
6
7
89
10
-0.5 -0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4 0.5
Dim ens ion 1; E igenvalue: .07609 (59.41% of Inert ia)
-0.4
-0.2
0.0
0.2
0.4
0.6
0.8
1.0
Dim
ension 2; Eigenvalue: .04095 (31.97%
of Inertia)
2D P lot of Row and Colum n Coordinates ; Dim ens ion: 1 x 2
Input Table (Rows x Colum ns ): 10 x 5
S tandardiz ation: Row and c olum n profiles
1
2
3
45
6
7
89
10
partCor
nonCor
flak eB d
bladebladelet
-0.5 -0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4 0.5
Dim ens ion 1; E igenvalue: .07609 (59.41% of Inert ia)
-0.4
-0.2
0.0
0.2
0.4
0.6
0.8
1.0
Dim
ension 2; Eigenvalue: .04095 (31.97%
of Inertia)
Multidimensional Scaling (MDS)
• aim: define low-dimension space that preserves the distance between cases in original high-dimension space…
• closely related to CA/PCA, but with an iterative location-shifting procedure…– may produce a lower-dimension solution than
CA/PCA– not simultaneously Q and R mode…
0 10 20 30 40 50 60 70EAST
10
20
30
40
50
60
NO
RT
H
A
B
D
C
A B C D
A B C D
‘non-metric’ MDS
‘metric’ MDS
Tree Diagram for 21 Cases
W ard`s method
Euc lidean distances
0 50 100 150 200 250 300
Linkage D istance
AtsinnaLPescadoSpr
CienegaPuebloMuerto
PescadoWRainbowSpr
TinajaNAtsinna
MirabalRuinGigantes
DayRanchJacksLake
UpperSoldadoBoxS
ScribeSMillerCanyon
Spier61UPescadoSpr
Spier81YellowHouseHeshYalawa
2D Plot of Row Coordinates; D imensions: 1 x 2
Input Table (Rows x Columns): 21 x 10
Standardization: Row and column profiles
Hes hY alawa
RainbowS pr
S pier61
Y ellowHous e
S pier81
P es c adoW
LP es c adoS pr
UP es c adoS pr
B ox S
Day Ranc h
P uebloM uerto
CienegaM irabalRuinTinaja
G igantesNA ts inna
A ts inna
M illerCany on
UpperS oldadoJac k s Lak e
S c ribeS
-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5
Dimension 1; Eigenvalue: .43072 (45.49% of Inertia)
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
2.5
Dim
ension 2; Eigenvalue: .23744 (25.08%
of Inertia)
2D Plot of Row Coordinates; D imensions: 1 x 2
Input Table (Rows x Columns): 21 x 10
Standardization: Row and column profiles
-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5
Dimension 1; Eigenvalue: .43072 (45.49% of Inertia)
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
2.5
Dim
ension 2; Eigenvalue: .23744 (25.08%
of Inertia)
Scatterplot 2D
Final Configuration, dimension 1 vs. dimension 2
Hes hY alawa
RainbowS pr
S pier61 Y ellowHous e
S pier81
P es c adoW
LP es c adoS pr
UP es c adoS prB ox S
Day Ranc h
P uebloM uerto
Cienega
M irabalRuin
Tinaja
G igantes
NA ts inna
A ts inna
M illerCany on
UpperS oldadoJac k s Lak e
S c ribeS
-1.0 -0.5 0.0 0.5 1.0 1.5 2.0
Dimension 1
-1.0
-0.8
-0.6
-0.4
-0.2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
Dim
ension 2
S hepard Diagram
0 20 40 60 80 100 120
Data
-0.5
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
Distances/D
-Hats
“Shepard Diagram”
Discriminant Analysis (DFA)
• aims: – calculate a function that maximizes the ability
to discriminate among 2 or more groups, based on a set of descriptive variables
– assess variables in terms of their relative importance and relevance to discrimination
– classify new cases not included in the original analysis
var A
var
B
DFA
• DFs = groups-1– each subsequent function is orthogonal to the
last– associated with eigenvalues that reflect how
much ‘work’ each function does in discriminating between groups
• stepwise vs. complete DFA
Figure 6.5: Factor structure coefficients: These values show the correlation between Miccaotli ceramic categories and the first two discriminant functions. Categories exhibiting high positive or negative values are the most important for discriminating among A-clusters.
Function 1
Funct
ion 2
-4
-3
-2
-1
0
1
2
3
4
5
6
7
-7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7
MC:acls4/1
MC:acls4/2
MC:acls4/3
MC:acls4/4
+ outcurving bowl+ cazuela/crater
(comales, other fine-wares)
+ olla
+ outcurving bowl (ollas, other fine-wares)
+ cazuela/crater
Figure 6.4: Case scores calculated for the first two functions generated by discriminant analysis, using Miccaotli A-cluster membership as the grouping variable and posterior estimates of ceramic category proportions as discriminating variables.
Figure 6.6: Factor structure coefficients generated by four separate DFA analyses using binary grouping variables derived from Miccaotli A-cluster memberships. A single discriminant function is associated with each A-cluster.