Newsweek's Top High Schools Methodology

Who’s at the Top? It Depends . . . Identification of Highest Performing Schools for Newsweek’s 2014 High School Rankings

Authors

Matthew Finster

Jackson Miller

2014

2

Who’s at the Top? It Depends . . .

Identification of Highest Performing Schools for Newsweek’s 2014 High School Rankings

2014

Prepared for: Newsweek LLC 7 Hanover Square, 5th Floor New York, New York 10004

Prepared by: Westat An Employee-Owned Research Corporation® 1600 Research Boulevard Rockville, Maryland 20850-3129 (301) 251-1500

3

Executive Summary This report describes the 2014 methodology and analysis for Newsweek’s “Top Public High School” rankings. The 2014 methodology is different from traditional high school ranking methods in that it produced two sets of rankings. We created two lists, an absolute list and a relative list, to demonstrate the consequence of accounting for non-school factors on school rankings. The absolute list ranks schools based solely on the achievement and college readiness indicators. The relative list ranks the highest performing schools after accounting for student poverty. These two lists reveal how the rankings vary when non-school factors (poverty, in this case) are considered. For both the absolute and relative rankings, we conducted multiple analyses to identify the “top” schools. Using data from National Center for Education Statistics (NCES, specifically data available through EDFacts and Common Core Data (CCD), we conducted a threshold analysis to identify schools with the highest levels of academic achievement based on average student proficiency as measured by scores on state standardized assessments. The absolute list identified the top 20 percent of schools in each state, that is, schools with performance above the 80th percentile. The relative list identified schools that were .5 standard deviation (SD) or higher than the line of best fit after controlling for the percentage of students eligible for free or reduced-price lunch. We surveyed schools identified in the threshold analysis to collect data related to college readiness. The web-based survey asked about basic demographic information, graduation rates, college enrollment rates, the number of full-time equivalent (FTE) counselors, the number of students who take the SAT and/or ACT, the average SAT and/or ACT score, the percentage of students taking at least one advanced placement (AP)/international baccalaureate (IB)/advanced international certificate of education/ AICE course, and the school’s students’ average AP/IB/AICE score. Using these data, we created a college readiness index score based on six indicators, namely, counselor FTE (weighted at 10 percent), changes in 9th-grade to 12th-grade enrollment rates (10 percent), a composite SAT/ACT score (17.5 percent), a composite AP/IB score (17.5 percent), high school graduation rates (20 percent), and college enrollment rates (25 percent). We conducted a second analysis to construct the absolute ranking, which ranked schools based on their college readiness index score, and the relative list, which ranked schools based on their distance from the line of best fit when controlling for student poverty. To highlight schools with similar achievement levels for all students, we noted schools on both lists in which economically disadvantaged students scored at or above the state average in reading and mathematics. This analysis also used NCES data obtained from EDFacts and the CCD. We conducted several sensitivity tests to assess the extent to which the rankings are dependent on the judgmental weights and to examine the degree to which interrelationships among the variables may have influenced the final rankings. For this analysis, we produced rankings using weighting scheme to assign equal importance to each of the six college readiness indictors. Next, we conducted a Principal Component Analysis (PCA) to generate weights and control for intercorrelations among the variables. Based on the results of our sensitivity tests, we concluded that school rankings were less sensitive to changes in the weighting scheme than they were to controlling for student poverty levels. However, these results are unique to these data. We encourage others to assess the sensitivity of their rankings to their weighting schemes and to address issues stemming from multicollinearity when ranking high schools.

4

Acknowledgments

This report was written by Westat for Newsweek’s “Top Public High School” ranking list. The authors, Matthew Finster and Jackson Miller, would like to acknowledge the following people who contributed to this work.

First, to the Newsweek staff for their patience and understanding throughout the process, and especially Maria Bromberg, for her dedication to the project.

We also thank our project director, Allison Henderson, and our Westat colleagues, Jill Feldman, and Shen Lee for providing feedback and support. In particular, we would like to thank Anthony Milanowski for his counsel throughout the process, especially regarding the sensitivity tests, and for reviewing the report. Thanks also to thank Edward Mann and his team for conducting the web-based survey.

We would like to acknowledge the contributions made by the advisory panel, Russell Rumberger at University of California, Santa Barbara; David Stern at University of California, Berkeley; Brad Carl at The Value-Add Research Center at the Wisconsin Center for Education Research; Nicholas Morgan at The Strategic Data Project at the Harvard Center for Education Policy Research; and Elisa Villanueva Beard at Teach for America, for their input.

Last, we want to thank the education staff that took the time to provide us with survey data.

Matthew Finster Research Associate Westat 1600 Research Blvd. Rockville, Maryland 20850 Jackson Miller Research Associate Westat 1600 Research Blvd. Rockville, Maryland 20850

5

Contents

Introduction ............................................................................................................................ 7 Review of Relevant Literature and Previous Methods...................................................... 9 Methodology for Identification of Top Schools for Newsweek’s Top Public

High Schools Rankings .................................................................................. 12 Data Sources .................................................................................................... 15 Analysis Details ............................................................................................... 16

Threshold Analysis: Based on Academic Achievement Indicators ........................................................................... 16

Absolute Ranking: Identification of high schools that performed the highest on accountability assessments within state .................... 16

Relative Ranking: Identification of high schools that exceeded expectations on accountability assessments based on levels of student socioeconomic status, within state ............................ 16

Ranking Analysis: Based on College Readiness Indicators ......... 18 Absolute Ranking: Identification of high schools that had the

highest college readiness index scores ....................................... 18 Relative Ranking: Identification of high schools that performed

higher than the state average after controlling for students’ socioeconomic status ............................................................... 18

Equity Analysis: Identification of high schools that have economically disadvantaged students that are performing higher than the state average ........................................................... 19

Limitations ....................................................................................................... 20 Conclusion and Discussion ................................................................................................. 22 Appendix A: Sensitivity Tests ............................................................................................. 23 References… ......................................................................................................................... 36

6

Figure 1: Scatterplot of Schools’ Achievement Index Scores by Percentage of Economically Disadvantaged Students in Connecticut ....................................................................................................... 17

7

Introduction

The purpose of this brief is to describe the methodology (hereafter termed 2014 methodology) used to develop Newsweek’s 2014 High School Rankings. The methodology produced rankings based on two aspects of school performance: the academic achievement of all students on state assessments in reading and mathematics and the extent to which high schools prepare their students for college. We also incorporated an equity measure to highlight schools that have high levels of achievement for all students based on economically disadvantaged students having levels of academic achievement at or above the statewide average level of achievement of all students.

One of the biggest challenges in ranking school performance is addressing whether, and if so how, to incorporate the substantial influence of family background on students’ success in school. From the time when the Coleman Report (Coleman et al., 1966) found that family background has a considerable influence on student achievement, numerous studies have confirmed this finding and concluded that students from more affluent families, whether measured by household income or parents’ occupation or level of education, experience substantial advantages over students from less affluent families (Sirin, 2005; Stinebrickner & Stinebrickner, 2003; Sutton & Soderstrom, 1999). Thus, it is generally agreed on that schooling outcomes are influenced by a combination of factors inside and outside of school. However, most high school lists ignore the influence of family characteristics when ranking schools.1 When family background characteristics are not included, all student performance is erroneously attributed to the school. Removing the influence of students’ socioeconomic status on school-level outcomes can drastically change the high school rankings (e.g., Toutkoushian & Curtis, 2005). To address the common criticism that school rankings are driven by student background characteristics, we developed two lists: one which is referred to as the “absolute” list, and another which is referred to as the “relative” list. The absolute ranking is based on the highest scores on the college readiness index and ranks schools without accounting for student background characteristics. The relative ranking, on the other hand, accounts for students’ socioeconomic status. The relative list controls for student socioeconomic status to clarify the association between influences occurring outside of school and student performance. These two lists are different and answer different types of questions. If someone is interested in knowing which schools perform the best in readying students for college, regardless of whether the performance is attributable to the school or to family background, then one should refer to the absolute list. If, on the other hand, an individual is interested in knowing which schools perform the best on the college readiness index relative to their levels of student socioeconomic status.

Furthermore, we developed an equity measure for the schools on both lists to assess economically disadvantaged students’ performance in schools. There are long-standing conceptions of equity and adequacy in education finance that have evolved over time (e.g., Baker & Green, 2009; Berne & Stiefel, 1984). Broadly, the focus of these debates has shifted from the equity of input measures, such as school resources, to adequacy of educational outcomes—for all students. Concurrently, national education policy has been squarely focused on the goal of educating all students to high standards (e.g., No Child Left Behind Act of 2001, Race to the Top Fund). Since high achievement and college readiness for all students is a critical component of a successful high school, for both the absolute and relative rankings, we note schools that have economically disadvantaged students that are achieving at or above the state average in both reading and mathematics. This approach provides

1 An exception to this is U.S. News High School Ranking produced by the American Institutes for Research’s (AIR), for details please, see Duhon et. al.

(2013).

8

a rough estimate of equity and indicates whether economically disadvantaged students in a school have average performance levels that are at least as high as state averages on standardized reading and mathematics assessments. This measure provides another way to distinguish between schools, based on the size of performance gaps between economically disadvantaged students in a school and all students in the state.

The 2014 methodology extends on Newsweek’s 2013 ranking methodology in several ways. First, the 2013 methodology relied solely upon self-reported data provided by schools to create a college readiness index. For 2014, we used secondary data collected by the National Center for Education Statistics (NCES) to supplement self-reported information about college readiness provided by schools. Student achievement data obtained from NCES were used to select a sample of schools (for the absolute list, n=4,400 and for the relative list n=4,100) from which to collect college readiness data. Additionally, using data from the CCD, the 2014 methodology includes a variable to control for average student poverty levels, which are known to be associated with student achievement and college readiness. We controlled for student poverty levels to address critiques of past rankings that claim that school performance as measured by average student test scores is as much or more a function of student background characteristics than of factors within a school’s control, and thus most rankings erroneously attribute variation in student performance to schools alone. To illustrate the difference between a list that controls for student poverty and one that does not, we constructed two different rankings, an absolute ranking and a relative ranking. To assess how “sensitive” rankings were to the judgmental weights we applied, we conducted several sensitivity tests.

The 2014 methodology consists of a similar multi-step analysis for both the absolute and relative rankings. However, the relative ranking accounts for student socioeconomic status, and the absolute list does not. The first step was to assess schools’ performance within their respective state. For this analysis, we constructed an academic achievement index (AI) based on school average proficiency rates on state standardized tests. We conducted this “threshold” analysis to identify schools that performed above a defined cut point. For the absolute list, we selected schools that were in the 80th percentile or higher in each state’s performance distribution (i.e., top 20 percent). For the relative list, we selected schools that performed .5 SD or higher than the average school in the state with similar proportions of students eligible for free or reduced-price lunch.

Schools that performed above the defined threshold proceeded to the next step—the “ranking” analysis based on the college readiness data. For this step, Westat surveyed both schools on both lists to collect data about college readiness indicators (e.g., ratio of students to counselor full-time equivalent (FTE), percentage of students who took advanced placement (AP) tests, and average SAT scores). Using these indicators, we created a college readiness index to rank the schools. For the absolute ranking, we ranked schools by their college readiness index score. For the relative ranking, we ranked schools by their college readiness index score, controlling for socioeconomic status. We conducted a separate analysis to assess the sensitivity of the rankings to the judgmental weighting scheme used for the college readiness index. This analysis is mentioned below and described in detail in appendix B).

The final step was to analyze whether economically disadvantaged students in each school performed better than the state average in reading and mathematics. Schools in which economically disadvantaged students performed at least as high or higher than the state average in both reading and mathematics were denoted as “equitable” schools.

9

The next section discusses relevant research, including common critiques, pertaining to school ranking methodologies.

Review of Relevant Literature and Previous Methods

The purpose of the Newsweek high school rankings is to identify the top 500 public high schools in the country. Given the diversity among public high schools across the nation, developing one system to rank schools presents a challenge. The problem is developing one comprehensive ranking system that adequately accounts for heterogeneity among schools. Gladwell (2011) describes this dilemma in his article in The New Yorker in which he discusses the opposing forces of comprehensiveness vs. heterogeneity. According to Gladwell, a ranking system can be comprehensive so long as it is not trying to rank units (schools, in this case) that are heterogeneous. Similarly, a ranking system can account for heterogeneity so long as it does not also strive for comprehensiveness. Ranking the nation’s top public high schools is, by nature, comprehensive. Yet, there is great diversity among public high schools, much of which cannot be accounted for. Schools vary with regard to their settings—urban, suburban, and rural—and the variety of curriculum and programs they offer. Even with a narrow focus on college preparation, the extent to which schools focus on AP, international baccalaureate (IB) or duel-enrollment varies considerably, and the extent to which high school students take the SAT or ACT varies considerably by state. Schools also enroll students with significantly different background and demographic characteristics, and studies have repeatedly demonstrated that these differences influence achievement in a variety of ways (e.g., Sirin, 2005; Stinebrickner & Stinebrickner, 2003; Sutton & Soderstrom, 1999). To account for at least some of this heterogeneity, we developed two ranking systems. The first ranking scheme identified and ranked schools by their performance on the academic achievement and college readiness measures. In essence, this ranking scheme identifies the schools with the highest absolute performance on these measures, regardless of influencing factors. A problem with this approach is that it ignores the influence on achievement attributable to student background characteristics. So while this ranking scheme identifies schools with the highest scores using these metrics, estimates of the association between school factors and performance are confounded with contributions made by students’ background and family characteristics. That is, it is not clear to what extent student performance is associated with school versus non-school factors. In an effort to address this confound, we devised the relative ranking method to account for student poverty levels. 2 Using the relative method, rankings are based on school performance levels relative to the socioeconomic status of its student body. Since the relationship between poverty levels and student performance is partially accounted for, schools that are ranked highly using this method are not necessarily the same schools that have highest absolute levels of performance. Another important decision when constructing ranking schemes is which variables to include and how much each variable will contribute to the final rankings (Bobko, Roth, & Buster, 2007; Gladwell, 2011; Webster, 1999). To determine which variables to include, we referred to previously used methodologies (e.g., Duhon, Kuriki, Chen, & Noel, 2013; Streib, 2013) and related research (e.g., Grubb, 2009). There is a vast amount of research about the economics of education and education finance that examines the production function of schools (e.g., Grubb, 2009; Harris, 2010)

2 This study is not causal and does not make any causal claims about the influence of schools on students. To really separate the influence of the

school vs. student background characteristics, a causal study would be required. In this case, we are parsing out the association of the school

performance with student backgrounds.

10

that provides defensible rationales for factors that should be included or accounted for in a ranking scheme (e.g., financial resources, teacher experience, staff development, college pressure, school problems, family background and student connectedness). Unfortunately, limited data are available for all schools, and available data exclude many of the recommended factors (This issue is addressed again in the limitations section.) Hence, we relied on common measures that have been used in previous rankings (e.g., Duhon et al., 2013; Streib, 2013). These variables include student performance on standardized assessments, high school graduation rates, SAT and ACT scores, performance metrics on AP/IB exams, and college acceptance and enrollment rates. (Specific items are discussed in greater detail below.) In addition to commonly used items, we included two variables that we derived from research about school performance metrics. The importance of students’ engagement with counselors has been shown to positively influence college attendance rates (e.g., Grubb, 2009) and, thus, could be considered one suitable indicator of college preparation. However, due to data limitations, we used a simple school resource indicated by pupil-counselor full time equivalent (FTE) ratio to reflect this construct. Additionally, we added another variable, referred to as “holding power” (e.g., Balfanz & Legters, 2004; Rumberger & Palardy, 2005), that indicates the dropout and transfer rate of 9th-grade students and can be considered another appropriate indicator of school quality (e.g., Rumberger & Palardy, 2005). Typically, data for high school rankings are obtained from national or state-level data sets or from self-reported survey data provided by the school or school district personnel. Both sources of information have strengths and limitations. Data from national and state sources typically lag by two or three school years. For this reason, rankings that rely on information from these sources use data that are two years old (e.g., Duhon et al., 2013). For example, 2013 rankings would rely on data collected for SY 2011-12. A primary benefit of using these types of datasets is that there are uniform reporting procedures and data checks in place, resulting in data that are more reliable with regard to consistency and accuracy; although these data are still self-reported by educational personnel. An alternative is to survey schools directly and ask them to self-report the requested data. While this option allows for the more current data to be collected than can be obtained through state or national data files, this method is more susceptible to participation bias and lacks standardized reporting procedures. To address issues stemming from self-reported data, we established procedures to assess the credibility and feasibility of some data. (These procedures are discussed in the data section.) An important note is that for the rankings, the unit of analysis is the school. We used means of student performance indices to indicate school performance. Using average student performance as an indicator of school performance is a common practice, and in fact, most states tie school funding to these metrics (Cobb, 2002; Hall, 2001). This practice of using average student performance as a measure of school performance in rankings like these has been criticized by some for not distinguishing between student vs. school level performance (e.g., Di Carlo, 2013). However, student means are commonly used as an indicator of school performance, for example, in state funding and accountability systems, and the only way to properly address this issue is to use student-level data and conduct a multilevel (e.g., hierarchical) study, however, this kind of student-level data are not currently available for all public high schools. In addition to determining which variables to include, we also had to determine how to combine them. This process entailed determining how items could be used for comparative purposes and

11

how much to weight these items to generate a composite score. Methods can involve a single-stage (e.g., Streib, 2013) or multi-stage process (e.g., Duhon et al., 2013). A benefit of using a multi-stage is that it establishes a minimum performance standard and conducts additional analysis on just the subsample of schools that meet the initial criteria. Additionally, it also allows for the inclusion of student achievement data that need to be assessed within state. Due to the lack of comparability across state assessments with regard to their difficulty and content focus, we used relative performance within states to determine the minimum threshold, as opposed to including this as part of the ranking criteria. This approach accounts for the nested structure of standardized assessments within states and identifies schools that met the baseline criteria. (See the methods section for additional details.) Determining an appropriate weighting scheme is a thorny issue. Ways of combining data elements into composite scores has been debated by statisticians and researchers for decades (Bobko et al., 2007). Three common procedures for combining items into a composite score are using expert judgmental weights, equal (or unit) weighting, and regression weights. Judgmental weighting relies on using theory and professional judgment to establish a weighting scheme. Equal weighting applies an equal amount of importance to each item in the composite. Regression weighting uses a formula based on the relationship between the measures and a criterion to establish weights, although it is only possible to use this approach when there is a criterion measure, limiting its viability in this case. For public high school rankings, previous methods have relied on judgmental weights (e.g., Duhon et al., 2013; Streib, 2013); however, equal weighting is a viable option. Research that examines judgmental versus equal weighting (e.g., Bobko et al., 2007) points out that in many cases, each approach produces similar results. Yet, in some instances, it makes more sense to weight items based on theorized substantive (or empirical) importance of the variables. Based on this information, we used a judgmental weighting scheme to generate the composite scores and, subsequently, rank schools. We also assessed the sensitivity of the rankings to the weighting scheme by generating rankings using equal weights and comparing those results to the rankings produced by the judgmental weighting schema. Furthermore, we conducted a principal component analysis to generate rankings using weights derived from the empirical relationships between the variables. (For a discussion of the equal and PCA weighting analysis, please refer to appendix A.) One problem to acknowledge and potentially control for when using judgmental weights is the inter-correlations between the items. In some cases, items can be so similar that they are measuring the same thing. Previous ranking studies (e.g., Duhon et al., 2013; Streib, 2013) do not indicate whether or how multicollinearity among the variables was accounted for. This is problematic because if the items are moderately or highly correlated, the explicit weights may vary substantially from the actual contributions of the items to the rankings (e.g., Weber, 1999). Furthermore, items with very high correlations (e.g., r=+.9) should not all be included in a model. If two items are correlated at r=.9, they are essentially measures of the same underlying construct. Given this, it is important to examine the inter-correlations of indicators used in the ranking formula. If items are highly correlated, the same results could be produced more efficiently by a simpler model that includes fewer indicators. Some methods explicitly control for multicollinearity. Principal component analysis is a technique that generates components and weights based on empirical data. PCA is one way to assess whether the explicit weights and the actual contributions of the data to the ranking vary substantially. The results of the PCA we conducted are discussed in appendix A. The following section explains and elaborates on the 2014 methodology.

12

Methodology for Identification of Top Schools for Newsweek’s Top

Public High Schools Rankings

For the 2014 methodology, we conducted a multi-step process consisting of a threshold, a ranking, and an equity analysis. The threshold analysis assessed schools’ performance as measured by students’ achievement levels on standardized assessments and identified (1) the schools that have the highest levels of academic achievement (for the absolute ranking) and (2) the schools that have the highest levels of academic achievement given the socioeconomic status of their students (for the relative ranking). Schools that were identified in the threshold analysis were surveyed to obtain data about college readiness indicators, and, pending completion of the survey, schools proceeded to the second analysis. The second analysis ranked schools by their responses to several college readiness indicators that identified (1) the schools with the highest levels of college readiness and (2) the schools that have the highest levels of college readiness after accounting for the socioeconomic status of their students. (Additional analyses were conducted to assess the sensitivity of the rankings to the judgmental weights and the intercorrelations among the variables). (See appendix A for details.) For the last analysis, we compared the performance of each school’s economically disadvantaged student population to the average performance in reading and mathematics for all students within each state.

The methods used to develop the rankings were designed to:

Identify high schools within each state that have the highest performance as measured by academic achievement on state assessments in reading and mathematics

Assess the extent to which the top-performing schools have prepared their students for college and to rank them accordingly

Recognize schools that have high levels of achievement among economically disadvantaged students.

The procedures are similar for both the relative and absolute rankings; however, for the relative ranking, we accounted for the socioeconomic status of the schools’ student body. Details related to each step in the process follow:

Threshold Analysis: Create a high school achievement index based on performance indicators (i.e., proficiency rates on state standardized assessments). For the absolute list, the index was used to identify high schools that perform at or above the 80th percentile within each state. For the relative list, the index was used to identify high schools that perform .5 SD or more than their state’s average when accounting for students’ socioeconomic status.

Ranking Analysis: For the high schools on both lists identified in the threshold analysis, we created a college readiness index based on the following six indicators: the ratio of students to counselor FTE, changes in 9th- and 12th-grade student enrollment rates (referred to as holding power), high school graduation rates, a weighted SAT/ACT composite score, a weighted AP/IB composite score, and the percentage of students enrolling in college. The weighting scheme for the index is below:

o Holding Power—10 percent

13

o Ratio of Counselor FTE to student enrollment—10 percent o Weighted SAT/ACT—17.5 percent o Weighted AP/IB composite—17.5 percent o Graduation Rate—20 percent o Enrollment Rate—25 percent

For the absolute rankings, we rank ordered the schools by their college readiness index scores. For the relative list, we ranked the schools based on how well the schools performed relative to schools with similar proportions of students eligible for free or reduced-price lunch.

Equity Analysis: Of the top high schools identified in the ranking analysis, we then identified schools in which economically disadvantaged students performed better than the state average for all students in reading and mathematics. This part of the analysis did not affect the rankings. Instead, we incorporated this step to recognize schools that have equitable academic performance for economically disadvantaged students as indicated by their performance levels relative to the state average for all students on both the reading and mathematics assessment. Table 1 provides an overview of the steps, analysis, outcomes, and data used for the 2014 methodology.

Table 1: Summary of Proposed Methodology for 2014

Threshold Analysis: Student Achievement

Identification of high- achieving schools

For the relative list:

Accounting for socioeconomic status of students

Analysis

Creation of academic achievement index (AI) scores for each school by state*

For relative list:

Within each state, scatterplot of schools’ AI scores by socioeconomic status indicator ( percent free or reduced-price lunch)

Ranking Analysis: College Readiness

Identification of schools achieving high marks on college readiness

For the relative list:

Accounting for socioeconomic status of students

Analysis

Of the schools that proceed from the threshold analysis:

Creation of college readiness index scores for each school*

For relative list:

Scatterplot of schools’ college readiness scores by socioeconomic status indicator (percent free or reduced-price

Equity Distinction: Identification of schools with economically disadvantaged students performing better than the state average on both the math and reading assessments

Analysis

Of the top schools in tier 1:

Comparison of schools’ economically disadvantaged students with state averages

14

Outcome

For absolute list:

Schools within each state were rank ordered by their AI score Schools in the top 20 percent within their state proceed to step 2

For relative list:

Schools with an AI score of 0.5 SD above the line of best fit selected to proceed to step 2

lunch)

Outcome

For absolute list:

Schools were rank ordered by their scores on the college readiness index.

For relative list:

Schools ranked based on their (standardized) distance from the line of best fit

Outcome

Schools with higher achievement than the state average on both reading and math among economically disadvantaged students are recognized as “equitable”

Achievement Data

Source:

NCES EDFacts Data

Indicators:

Academic achievement in mathematics, reading/ language arts, science

College Readiness Data

Source: Email/web- based school surveys, NCES data

Indicators:

Ratio of student enrollment to counselor FTE, change in student cohort enrollment rates from 9th to 12th grade (i.e., holding power), percentage students taking SAT/ACT, average SAT/ACT score, graduation rates, college enrollment rates, percentage of students taking at least one AP/international baccalaureate (IB)/Advanced international certificate of education (AICE) course, ratio of AP/IB/AICE test per student, average AP/IB/AICE score

Subgroup Achievement Data

Same data as threshold analysis by student subgroups

15

Data Sources

The threshold and equity analyses are based on state standardized assessment data obtained from NCES. Assessment data available on states’ websites vary with regard to their completeness, limiting comparability and, in some cases, excluding states from the analysis. Given this, we used publicly available data from NCES (found at www.data.gov) about performance for all schools during the 2011-12 school year to determine which schools met the initial threshold criteria using systematically collected data for a more complete set of schools. NCES suppresses publically available data to protect the identity of students in schools with small student sub-group populations. The suppressed data placed schools into student proficiency ranges that varied from 3 to 50 percentage points. Of the schools contained in the database, we selected regular public high schools, excluding vocational, special education, and alternative schools, as defined by the NCES data manual. For the threshold analysis, we have records and data for 14,454 public high schools. We used this sample in the threshold analysis to construct both the absolute and relative lists. Of the 14,454 schools, the threshold analysis identified approximately 4,400 schools for the absolute list and 4,100 for the relative list. A total of 2,533 schools were on both lists.

We collected survey data from schools that appeared on either list to conduct the ranking analysis for schools that met or exceeded the threshold criteria. A link to the web-based survey was sent to the schools via regular mail and by email to collect basic demographic information, graduation rates, college enrollment rates, the number of counselor FTE, the number of students taking the SAT/ACT, average SAT/ACT scores, the percentage of students taking at least one AP/IB/AICE course, and average AP/IB/AICE scores. We collected data from the 2011-12 school year to align with the most recent year of publically available achievement data. For the schools that received a survey, the response rate was 38 percent (n=1551) for schools on the relative list and 35 percent (n= 1529) for the schools on the absolute list.

We used the survey data to create the college readiness index score for each of the schools in both of our samples. We also added the holding power variable to the index, which was derived from NCES data to determine the ratio of 12th-grade students in the 2011-12 school year to 9th-grade students in the 2008-09 school year. However, given the limitations of the data in the CCD, we were not able to account for students who transferred to a different school during this period.

After collecting the data, we ran data checks to identify schools with data that were likely incorrect. Schools that reported having more students taking the SAT or ACT—or more students taking AP or IB classes—than the total number of students enrolled were removed from the data. In addition, we removed schools that had a counselor-to-enrollment ratio that was excessively large or small. We defined excessive as any school with a standardized ratio (the z-score of the ratio) above 3 or below -3. Finally, we capped the graduation rate at 100 percent for schools that reported having more high school graduates than total 12th-grade students and capped the college enrollment rate at 100 percent for schools that reported having more college enrollees than high school graduates. In some cases, it is possible for schools to graduate more students, or have more students enroll in college, than the number of 12th-grade students. However, capping these two variables at 100 percent eliminated any advantage a school could gain by graduating students outside of the typical four-year window. (For the sensitivity tests that we conducted, we further screened and cleaned the data to prepare them for multivariate analysis; these procedures are discussed in appendix A.)

http://www.data.gov/

16

The next section provides additional analytic details.

Analysis Details

This section provides more detailed information about the procedures we used to conduct each analysis and to construct both the absolute and relative rankings.

Threshold Analysis: Based on Academic Achievement Indicators

Absolute Ranking: Identification of high schools that performed the highest on

accountability assessments within state

Relative Ranking: Identification of high schools that exceeded expectations on

accountability assessments based on levels of student socioeconomic status,

within state

Substep 1: Calculate the academic achievement index

As part of the threshold analysis, we created an AI based on proficiency rates on states’ reading language arts and mathematics assessments. We calculated the index scores by taking the weighted average of the proficiency rates on the two assessments. The equation is:

Weighted avg.=((# taking reading/language assessment*pct. proficient)+(# taking math*pct. proficient))/(# taking reading/language assessment+ # taking math)

For the relative ranking, we also calculated the percentage of economically disadvantaged students for each high school:

The percentage of students living in poverty for each school was calculated using data from the CCD, which includes the number of students eligible for free or reduced-price lunch and the total number of students in each school. We calculated the percentage eligible for free or reduced-price lunch by dividing the number eligible by the total number of students in the school.

Substep 2: Identify top schools

Absolute Ranking: Identify schools that achieve within the top 20 percent on state English

language arts and mathematics assessments within their respective state

For this step, we ranked the schools by their AI composite score within state and identified the top 20 percent of schools. Schools in the top 20 percent proceeded to the ranking analysis.

17

Relative Ranking: Identify schools that have higher than expected achievement based on

the state-specific relationship between student achievement and percent

eligible for free or reduced- price lunch

We regressed the schools’ AI scores on their percentage of economically disadvantage students to generate a line of best fit that was used to determine the state-specific relationship between the AI and the percentage of economically disadvantaged students. For example, Figure 1 below shows the relationship between the index and average student poverty for Connecticut. Figure 1 depicts the type of analysis we performed for each state

Figure 1: Scatterplot of Schools’ Achievement Index Scores by Percentage of Economically Disadvantaged Students in Connecticut

The extent of the difference between a high school’s expected and observed achievement is captured by the residual from the line of best fit. The residual is the vertical difference between the observed value and the expected value indicated by the line of best fit. To identify schools that perform above the average, based on the line of best fit generated by the AI scores and percent of economically disadvantaged students, we established the cut-off point at +0.5 SD. That is, schools with residuals that met or exceeded +0.5 SD proceeded to ranking analysis.

02

04

06

08

01

00

0 20 40 60 80 100frlpct

wt_avg Fitted values

18

Ranking Analysis: Based on College Readiness Indicators

Absolute Ranking: Identification of high schools that had the highest college readiness

index scores

Relative Ranking: Identification of high schools that performed higher than the state

average after controlling for students’ socioeconomic status

For the ranking analysis, we repeated the procedures used in the threshold analysis, but we used the college readiness indicators to produce a college readiness index score for each school. As in the threshold analysis for the absolute rankings, we ranked schools on their college readiness index score and, for the relative rankings, we ranked schools based on their residuals from the line of best fit using the scatterplots of college readiness index scores by student socioeconomic levels.

Substep 1: Calculation of college readiness index scores for each high school that

progressed past the threshold analysis

For the high schools that proceeded beyond the threshold analysis, we created a college readiness index score based on the indicators in the survey. Due to patterns of missing data and high intercorrelations among some indicators, the college readiness index score is based on six indicators: holding power, ratio of counselor FTE to student enrollment, graduation rates, a weighted composite SAT/ACT score, a weighted composite AP/IB score and college enrollment rates. Regarding high intercorrelations, college acceptance rates and enrollment rates were highly correlated at r=.94. Due to the high level of this correlation, only one indicator, college enrollment, was selected for inclusion in the index.

To account for instances in which students take both the SAT and ACT assessments, and to account for instances where one or the other test is not typically taken, we created an average weighted SAT/ACT composite. The formula for the weighted SAT/ACT is:

Weighted SAT/ACT=((zSAT*# of students)+(zACT * # of students))/(# of students taking the SAT+ # of students taking the ACT)

Similarly, to account for schools that offer both AP and IB programs, and to adjust accordingly when one program or the other is offered at a school, we created an average weighted AP/IB composite. The formula for the weighted AP/IB composite is:

Weighted AP/IB =((zAP*# of students)+(zIB * # of students))/(# of students taking the AP+ # of students taking the IB

Substep 2: Determination of weighting scheme

To determine how to weight the six indicators in the college readiness index, we reviewed the literature for previous weighting schemes and the rationales that were provided for using them (e.g.,

19

Duhon et al., 2013; Streib, 2013).We also assessed the empirical relationships among the variables using a principal component analysis. Based on these theoretical and empirical sources, we created a judgmental weighting scheme that we used to rank the schools. As previously discussed, there is a lot of debate regarding how to combine measures into composite scores, and this conundrum has confronted statisticians and researchers for decades (Bobko et al., 2007). A common criticism is that weighting schemes are somewhat arbitrarily chosen or that the weights are designed to reward certain values (e.g., Gladwell, 2011). A more nuanced criticism is that, because of the multicollinearity among variables, the explicit weights that are used can vary substantially from the actual contributions of each variable when constructing the rankings (Webster, 1999). In other words, due to the fact that the items are interrelated often weights do not indicate how much an item actually contributes to the ranking. To partially address these criticisms, we tested several different weighting schemes to assess the extent to which the rankings changed based on which weighting option we used. As previously mentioned, viable options for combining multiple pieces of information into a composite score include judgmental weighting, equal weighting, and use of regression weights (Bobko et al., 2007). To compare the effects of using different weighting schemes, we used equal weighting, which is a “viable and useful option for forming composite scores” (Bobko et al., 2007, p.691), and we derived empirical components and respective weights using a principal component analysis (PCA) to control for multicollinearity among the indicators.

The rankings produced by these alternative weighting schemes were highly correlated (r=>.90), with the rankings generated by the judgmental weighting scheme. In this case, the three ranking schemes produced similar results; thus, while we cannot escape the fact that our weights are judgmental and reward certain items more highly than others, our results indicate the rankings produced using judgmental weights are similar to those produced by equal weighting and a weighting scheme based on the empirical relationships in the data. These findings suggest our rankings are not particularly sensitive to these various weighting schemes. (For more details about sensitivity tests, see appendix A).

Substep 3: Ranking of schools by their college readiness index scores

Similar to the process used in the threshold analysis, we used the college readiness scores to rank the schools. For the absolute ranking, schools were rank ordered based on their college readiness score. For the relative ranking, we ranked schools by their residuals expressed as the distance from the line of best fit. To do this, we created a scatterplot graph based on each school’s college readiness index score and the percentage of economically disadvantaged students. We used these data to generate a line of best fit and to determine a school’s college readiness score distance from line of best fit (i.e., the residuals), which we then standardized and used to rank the schools.

Equity Analysis: Identification of high schools that have economically disadvantaged

students that are performing higher than the state average

For the equity analysis, we assessed the performance of each school’s economically disadvantaged students and compared that with the state’s average performance. In addition to the fact that each state uses its own standardized assessments, there is evidence that suggests the student achievement gaps are nested by state (Reardon, 2014). As a result, the analysis of variation in student achievement disaggregated by student subgroups was compared to the respective state average for each school. Then, we noted any school in which the economically disadvantaged student population performed better than the state average on both the reading and mathematics assessments. However, due to

20

data suppression issues, many schools did not have data that were needed to conduct this analysis. Of the top 500 schools on the absolute list, 125 schools were not included. Of schools on the relative list, 93 schools were not included in this analysis. To avoid confusion interpreting these results, we indicated whether a school met our standard, did not meet our standard, or whether these data were not available.

Limitations

There are several limitations associated with conducting this analysis, many of which stem from the availability of data and their suitability for comparing schools in different states. For example, a wide variety of family and school factors are associated with student academic achievement and college readiness (Sirin, 2005; Stinebrickner & Stinebrickner, 2003; Sutton & Soderstrom, 1999). However, other than student poverty, we have access to data for only a few variables and constructs that have been shown to influence student performance. Additionally, we have no information about a range of school factors that may influence school performance, such as fiscal resources, teacher quality and effectiveness, school leadership, and school climate. These school factors could potentially contribute to student achievement and college readiness, but data are unavailable for a variety of reasons. For example, fiscal data are not collected at the school level (i.e., school fiscal resources) in federal datasets. In addition, indicators of teacher quality, such as teaching experience and/or credentials, are not reported in a comprehensive manner for all schools, and, even if they were, they are only very loose proxies for teaching quality (Goldhaber, 2002). States, districts, and schools often collect data about school leadership and climate, but variability in how these data are collected and differences in the psychometric properties of surveys make it difficult to compare results across states. The lack of quality data about all public high schools in the United States makes it difficult for ranking methodologies to include the full variety of factors, both within and outside of school, that are known to influence student achievement and college readiness.

Although we were limited by the availability of national level data to account for school and non-school influences on student achievement and college readiness, our survey is also limited regarding the extent to which it asks about all types of college preparation programs used across the country. Our survey collected data about six factors that research indicates contribute to students’ college readiness. However, there are other types of college preparation programs that are not accounted for in this study. For example, in lieu of AP or IB courses, some schools offer duel enrollment programs that allow students to earn college credit while still in high school. Similarly, there are many early college high school programs that partner with local colleges and universities to provide advanced coursework. Students in these schools can elect to take college courses rather than AP or IB courses. Including some of these factors in future analyses may enhance the methodology for identifying and ranking “top” performing schools, although some may argue that duel enrollment programs are dependent on college resources and not necessary a high school factor.

Another limitation of the study is the reliance on self-reported data to create the college readiness index. There are a few potential issues with using self-reported data that may have introduced bias into our analysis. First, the final rankings reflect only those schools that participated in the survey, not the entire list of schools identified in the threshold analysis. The rankings are dependent on schools that responded to the survey and indicate school’s standings within the sample of

21

respondents. In this sense, both the absolute and relative lists represent the relative ranking of the schools that responded to the survey, not of the entire sample identified in the threshold analysis. 3

Second, using self-reported data also introduces the potential for schools to game the rankings, or simply to inadvertently submit incorrect data and improve their ranking. If a school submitted erroneous data (intentionally or unintentionally) that appeared plausible, our data checks would not have identified that school, and we would have under- or overestimated its standing relative to the other schools on the list.

We did not control for any other factor besides student socioeconomic status. Some schools have selection processes that allow them to select high achieving students. Many of the top schools on the relative list are magnet schools and may have an application process that allows them to select high achieving students. These types of schools would have an advantage over schools that do not have an application process in our ranking methodology.

In the analysis for the relative list, we used school-level student proficiency rates to select schools that performed better than a pre-determined threshold. For example, the relative high school rankings, which controlled for student poverty levels, could be biased against schools with high levels of achievement and college readiness but low levels of student poverty. This bias is, at least partially, the result of ceiling effects on state assessments across schools within a state. In cases where average scores are near the top of the assessment scale, it is difficult for schools with low levels of poverty to exceed the state average (i.e., line of best fit) by a relatively large distance (e.g., .5 SD), making it difficult for these schools to meet the threshold criteria.

Another limitation is that the school-level proficiency rates in EDFacts are a cross-sectional measure and only represent a school’s performance at a single point in time. As a result, these data cannot be used to infer causality (i.e., we cannot say for certain that some action or actions taken by a school resulted in a higher ranking than a school that did not take that same action or actions) or make comparisons across states. Instead, we can simply compare school performance in the 2011-12 school year with other schools within the state (i.e., we could say that school A in state X had more students performing proficiently in math than school B in state X). Our methodology would be improved by using panel or longitudinal data to track student performance over time. Another option for assessing school performance is the use of statistical approaches like value-added models (VAMs) or student percentile growth models, which use longitudinal achievement data as well as other variables (both school and non-school variables) to assess the school’s effect on changes in

3 If the schools responding to the survey are systematically different than those schools that did not respond, this exposes the rankings to a potential

source of bias. We assessed differences between respondents and non-respondents on several observable characteristics, such as, student achievement and student poverty levels and found slight, though significant, differences in means between the two groups for both the absolute and relative groups. On average, schools that responded to the survey had higher student proficiency rates and lower percentages of student eligible for free and reduced priced lunch rates than non-respondents. For schools on the absolute list, there is a slight, significant different in means between respondents (m = 85.56, SD = 12.73) vs. non-respondents (m = 84.06, SD= 13.48) in reading language arts assessments (t (4305) = 3.62, p<.001). Also, non-respondents had a lower mean than respondents of student proficiency in mathematics assessments (m = 77.83, SD = 16.77 and m = 80.10, SD = 15.80, respectively) at a statistically significant level (t (4305) = 4.38, p<.001). Non-respondents also had a higher average percentage of students eligible for free or reduced-price lunch than respondents (m = 29.95, SD = 19.46, m = 27.86, SD = 20.66, respectively) at a statistically significant level (t (4305) = -3.34, p<.001). For schools on the relative list, we found a similar pattern, that is, non-respondents, on average, had lower student performance in reading language arts and mathematics assessments and had a higher percentage of students eligible for free and reduced priced lunch rates than respondents. Non-respondents had a lower mean of student proficiency in reading language arts than respondents (m = 79.62, SD = 15.97 and m = 82.39, SD = 14.58, respectively) at a significant level (t (4098) = 5.58, p<.001). Non-respondents had a lower mean of student proficiency in mathematics assessments than respondents (m = 73.10, SD = 18.51 and m = 76.49, SD = 17.27, respectively) at a statistically significant level (t (4098) = 5.82, p<.001. And, non-respondents had a higher percentage of students eligible for free or reduced-price lunch than respondents (m = 51.33, SD = 23.73 and m = 40.72, SD = 24.72, respectively) at a statistically significant level (t (4098) = -13.66, p<.001.)

22

student performance. However, using VAMs or growth models to generate a school ranking would produce a list of “top” schools in which students demonstrate the most growth on achievement measures, not the schools with the highest absolute performance. As we have noted, there are many limitations associated with this analysis. Future school rankings would be improved by including more school and non-school factors that have been shown to be associated with student achievement and college readiness to assess school performance and rank schools accordingly.

Conclusion and Discussion

We used two approaches to rank high schools. The absolute list provides a ranking of schools with the highest average performance on indicators for which we had data, using a judgmental weighting scheme, but without regard for other factors. The relative list provides a ranking of schools that perform the highest, using the same indicators and weights but after controlling for average student poverty levels. We did this in an effort to identify the student performance attributable to the school. While there are many ways to account for student background characteristics, accounting for student poverty in this manner significantly influences which schools are identified as “top” schools. Each method provides a different perspective regarding which high schools are achieving the highest levels of performance based on indicators of academic achievement and college readiness. The weighting scheme we used is less influential in ranking schools than controlling for the level of student poverty. That is, adjusting the weights makes relatively minor changes to the rankings. Based on the results of the rankings and the sensitivity analysis, we found that including student background characteristics has a larger influence than using various weighting schemes to determine a school’s rank. However, these findings are limited to these data and analyses. Researchers developing future school ranking methodologies are encouraged to examine indicators used for multicollinearity, assess the sensitivity of their rankings to the respective weights assigned to variables specified in their analytic model, and make an intentional decision regarding whether to exclude or include confounding factors.

23

Appendix A: Sensitivity Tests

Introduction

The purpose of the “sensitivity tests” was to gauge the extent to which different item weights

influenced the final high school rankings. To bolster the analysis, we also assessed the variation in

the rankings produced using equal (unit) weighting and weights derived from a “purer”

measurement model, namely, a principal component analysis. A common critique of ranking systems

is that they are sensitive to the item weights, which often are (or at least appear to be) somewhat

arbitrarily chosen (Gladwell, 2011; Webster, 1999). Gladwell (2011) highlights this problem in his

article in The New Yorker on college rankings by describing in detail how the item weights in the U.S

News and World Report methodology influence the final rankings. So, in some cases, while ranking

schemes may identify clear winners, the ranks may be highly dependent on the weights chosen for

the variables.

The conundrum of determining the appropriate weights for items to form composite scores has

been debated by statisticians and researchers for decades (Bobko et al., 2007). Options to determine

the weights for composite scores include using experts to determine judgmental weights, using equal

(unit) weights, and/or using empirical information to generate differential weights (Bobko et al.,

2007). As previously discussed, high school rankings typically rely on judgmental weights. In their

review of the literature on the utility of equal (unit) weighting in creating composite scores, Bobko

and colleagues (2007) affirmed the usefulness of equal weighting and recommended equal (unit)

weighting as a “useful and viable option when forming composite scores” (p.691). Hence, for our

first sensitivity test, we assessed the extent to which our public high school rankings changed based

on the equal weighting vs. our judgmental weights.

Another more complicated option to generate composite scores is to use the empirical data to

generate differential weights using regression methods. A benefit of regression methods and, more

specifically, a principal component regression analysis, is that it can control for inter-correlations

between the indicators and identify the most significant components (i.e., the most significant

ranking criteria). Using the ranking criteria from the U.S News & World Report tier rankings of

national universities, Webster (1999) conducted a PCA and found that due to extreme

multicollinearity among the indicators, the actual contributions of the ranking criteria were

substantially different from the explicit weighting schemes.

Due to the fact that multicollinearity among the indicators may mask the actual contributions of the

items when using the explicit judgmental and equal weighting schemes (e.g., Webster, 1999), we

conducted a PCA, among other reasons, to control for multicollinearity and assess the most

significant ranking criterion. We used a principal component analysis to (1) reduce the items down

to components which explain a majority of the variance in the data, (2) use these components to

produce component scores (using regression method) for each respective school, and (3) to weight

the components by their substantive importance as determined by their respective eigenvalues. In a

24

sense, a PCA is a “purer” measurement model due to the fact that it accounts for the inter-

correlations among the indicators and determines the substantive importance of the respective

components based on the variance in the empirical data. To test the sensitivity of the rankings to

multicollinearity between the items and the ranking criterion (i.e., weights across the items), we

created rankings based on the results of the PCA and compared the rankings to those produced by

the judgmental and equal weights.

Specifically, our questions were:

1) To what extent will the high school rankings produced from the judgmental and equal

weighting schemes correlate?

2) To what extent will the high school rankings that are produced using weights developed by

principal component analysis correlate with the rankings produced from the judgmental and

equal weighting schemes?

Methods

In this section, we discuss the methods we used for the equal weighting and, primarily, the PCA. For

the equal weighting analysis and PCA, we used the data obtained from the school survey discussed

previously. For the PCA, the components scores could only be produced from complete cases, this

reduced the n to 1,080 for the absolute list and 974 for the relative list. Nonetheless, according to

Comrey and Lee (1992), a sample size of 1,000 is excellent for factor analysis.

We used the six variables previously described—college enrollment rates, holding power, ratio of counselor FTE to student enrollment, the weighted SAT and ACT composite, the weighted AP and IB composite, and high school graduation rates—to create the rankings. We examined the data for normality, linearity, and outliers; however, since the principal component analysis is being used for descriptive purposes, assumptions regarding the distribution of variables in the sample are not in force (Tabachnick & Fidell, 2007). To address outliers in the data, we capped the graduation and enrollment rates at 100 percent and deleted records that exceeded the standardized z-score value of ±3 for all of the items.

Based on Bobko and colleagues’ (2007) recommendations, we used equal (unit) weighting across the

items to generate the school rankings. For the equal weighting analysis, we assigned an equal weight

(i.e., 16.6666) to all six items in the formula and ranked the schools accordingly. To produce the

rankings for the absolute list, the schools’ final scores were rank ordered. To produce the rankings

for the relative list, the schools’ scores were put onto a scatterplot by student poverty levels, and the

schools were subsequently ranked by their residuals from the line of best fit.

For the second part of our analysis, we conducted a PCA. PCA is a statistical technique applied to a

set of variables to reduce the variables into subsets that are totally independent (i.e., orthogonal) of

one another. We used PCA to examine whether the six variables could be reduced to a smaller

number and to provide weights that are based on components that are not correlated with each

other, so that any overweighting of the variables due to their intercorrelations is eliminated.

25

In PCA linear combinations (the components) of original variables are produced, and a small

number of these combinations typically accounts for the majority of variability within the set of

intercorrelations among the original variables. Having once discovered which components exist, it is

possible to estimate an individual’s (school in this case) score on a component based on his/her

scores for the constituent variables. There are several sophisticated techniques for calculating

component scores that use component score coefficients as weights in an equation. We used a

technique referred to as regression method in which the component loadings are adjusted to take

account of the initial correlations between variables, resulting in a purer measure of the unique

relationship between variables and components. The coefficients were used to produce component

(also referred to as factor) scores that were then weighed by their respective eigenvalues, which

indicate the substantive importance of the associated component. This process produced an overall

score by which the schools were subsequently ranked. For the absolute list, schools were rank

ordered by their scores, and for the relative list, schools were ranked by their residuals from the line

of best fit in a scatterplot of the school score by student poverty.

There are several important characteristics of PCA. In PCA all variance is analyzed, and components

are simply aggregates of correlated variables. In this sense, the variables cause or produce the

component. There is no underlying theory about which variables should be associated with which

factors; they are simply empirically associated. Thus, any labels applied to derived components are

merely convenient descriptions of the combination of variables associated with them and do not

necessarily reflect some underlying process.

Results

Equal weighting vs. judgmental weights

For our first sensitivity test, we determined the correlation between the rankings produced by the

equal weighting scheme with the rankings produced by the judgmental weighting scheme for both

the absolute and relative list. Under the equal weighting scheme, we weighted all six variables at

.16666. For the judgmental weighting scheme, we weighted each variable based on our research into

the factors that affect students’ college readiness. Our judgmental weights were .25 for college

enrollment rate, .20 for graduation rate, .175 for the ACT/SAT composite, .175 for the AP/IB

composite, .10 for holding power, and .10 for the ratio of counselor FTE to 12th-grade students. For

the absolute list, the correlation between the lists derived from the judgmental weights and the equal

weights is r=0.993. This indicates that there is high correlation between the two methods and that

the two methods result in a very similar ranking of schools. For the relative list, the correlation

between the rankings derived from the judgmental weighting and equal weighting is r=0.990. Again,

this indicates that there is a high level of correlation between the rankings produced by the

judgmental and equal weights.

In this case, the high correlations are not surprising because sensitivity to the weighting scheme is

influenced by the size of the ratios between largest and smallest weights applied, and in the case of

26

the judgmental weighting scheme, the ratio is relatively small, with the largest indicator accounting

for only 25 percent and the smallest accounting for 10 percent.

Principal Component Analysis

Principal component analysis was performed on the relative and absolute school survey data using

the same six indicators. The results from the PCAs, in which we extracted four components using

both the absolute and relative school data, are discussed below. In addition to these results, we also

ran several variations of the PCAs using different extraction rules and rotation options, each of

which produced similar results. In several instances throughout the discussion we refer to those

variations.

Tables 1 and 2 show the univariate descriptive statistics for the absolute and relative school data

indicators:

Table 1: Descriptive Statistics of Absolute School Indicators

Indicators Mean Std. Deviation Analysis N

z_abs_pctenroll .114777 .8591348 1080

z_abs_fteenroll -.084126 .2311849 1080

wt_abs_SATACT -.031373 .8124109 1080

wt_abs_apib .020204 .9932075 1080

z_abs_gradrate .043956 .6338433 1080

z_abs_holdingpower -.046034 .0474650 1080

Table 2: Descriptive Statistics of Relative School Indicators

The correlation matrix for the absolute and relative school indicators are presented below in Tables

3 and 4. Notably, most of the variables are slightly to moderately correlated. In the absolute school

data, 10 of the 25 correlations are significant. The percentage enrollment in college items is slightly

correlated with the ratio of counselor FTE to student enrollment, the weighted SAT and ACT

composite score, the weighted AP and IB composite score, and high school graduation rates. The

ratio of counselor FTE to student enrollment is slightly correlated with the percent enrollment in

college and the high school graduation indicators. The weighted composite score for SAT and ACT

is slightly correlated with percent enrollment in college, high school graduation rates, and holding

Indicators Mean Std. Deviation Analysis N

z_rel_pctenroll .082 .919 974

z_rel_fteenroll -.044 .168 974

wt_rel_SATACT -.009 .820 974

wt_rel_apib .019 .975 974

z_rel_gradrate .061 .623 974

z_rel_holdingpower -.049 .042 974

27

power. The SAT/ACT composite score is also moderately correlated with the weighted AP/IB

composite score. The weighted AP/IB composite is slightly correlated with the percent enrollment

in college and very weakly (though significantly) with high school graduation rates and holding

power.

Table 3: Correlation Matrix of Absolute School Indicators

Indicators z_abs_pcten

roll z_abs_fteen

roll wt_abs_SATA

CT wt_abs_a

pib z_abs_gradr

ate z_abs_holdingpo

wer

z_abs_pctenroll 1 - - - - -

z_abs_fteenroll .141*** 1 - - - -

wt_abs_SATACT .368*** .039 1 - - -

wt_abs_apib .247*** -.034 .502*** 1 - -

z_abs_gradrate .356*** .127*** .139*** .091*** 1 -

z_abs_holdingpower

.039 .015 .096*** .056* .007 1

Note. * p<.05, ** p<.01, *** p<.001.

In the relative school data, 11 out of 15 correlations are significant. And, again, most of the variables

are slightly correlated. However, there are some different patterns among the correlations.

Table 4: Correlation Matrix of Indicators Using Relative School Data

Indicators z_rel_pctenr

oll z_rel_fteenr

oll wt_rel_SATA

CT wt_rel_api

b z_rel_gradra

te z_rel_holdingpow

er

z_rel_pctenroll 1 - - - - -

z_rel_fteenroll .022 1 -' - - -

wt_rel_SATACT .381*** -.054*' 1 - - -

wt_rel_apib .255*** -.039 .488*** 1 - -

z_rel_gradrate .347*** .053* .220*** .107*** 1 -

z_rel_holdingpower

.055* -.008 .179*** .080** .024 1

In the PCA with four components extracted using the absolute school data, the Kaiser‐Meyer‐Olkin

(KMO) Measure of Sampling Adequacy is .612. This KMO statistic meets Kaiser’s (1974)

recommendation of .5 as a bare minimum and is in the mediocre range based on Hucheson &

Sonfroniou’s (1999) guidelines. For the relative school data, the KMO measure of sampling

adequacy is .635. This statistic also meets Kaiser’s (1974) bare minimum criterion and is adequate

based on Hucheson & Sonfroniou (1999) guidelines.

In the PCA using the absolute school data, the diagonal elements of the anti-image correlation

matrix are all above .5. Similarly, in the PCA using the relative school data, the diagonal elements of

the anti-image correlation matrix are also all above .5. For these values, Field (2005) recommends a

bare minimum of .5 for all variables, and he recommends variables with lower values should be

28

considered for exclusion. However, in this case, no values on the diagonal of the anti-image

correlation are below .5, so none were considered for exclusion based on Field’s (2005) criteria.

For both the absolute and relative school data, the Bartlett’s Test of Sphericity is significant

(p<.001), indicating factor analysis is appropriate. Bartlett’s measure tests the null hypothesis that the

original correlation is an identity matrix. A significant value indicates that the R‐matrix is not an

identify matrix and that there is some relationship between the variables that we capture in our

analysis (Field, 2005).

Another indication of the appropriateness of the model fit is the differences in the correlations in

the reproduced matrix and in the R‐matrix. For a good model, the differences (referred to as

residuals) will be less than .05, and if 50 percent or more are greater, then it is grounds for concern

(Field, 2005). For the PCA using the absolute school data, there are six (40.0 percent) nonredundant

residuals with absolute values greater than 0.05. And, for the PCA using the relative school data,

there are also six (40.0 percent) nonredundant residuals with absolute values greater than 0.05. Based

on Field’s (2005) guidelines, these values do not signal a concern.

It is important to note that since we are using PCA for the sole purpose of data reduction, there is

no assumption that there are a certain number of underlying factors that influence the six college

readiness measures. These tests are primarily related to whether the sample size and data are

sufficient to find such factors. Though that is not our purpose, we included these indices as

references to the adequacy of the PCA model.

The first part of the analysis is the extraction process, which determines the linear components

within the data set (referred to as eigenvectors). Technically, there can be as many eigenvectors as

indicators, yet most will be unimportant. To determine the importance of a component, the

eigenvalues can be assessed. The eigenvalues represent the variance explained by a particular linear

component, and this value can be presented as percentage of variance explained. For both the

absolute and relative school data, four components were extracted based on the Jolliffe’s (1972,

1986) recommendation of .7. The scree plots indicate that either two or four components could be

justified. Two components would be extracted based on Kaiser’s rule; however, Kaiser’s rule can be

overly strict (Jolliffe, 1972, 1986) and, in these cases, only retains roughly half of the total variance.

In accordance with Jolliffe’s (1972, 1986) criteria and the scree plot criterion, and to retain more of

the total variance, we chose .7 as the extraction cut-off point. (We also ran the model using Kaiser’s

criterion and found similar results in the output of the PCA.)

29

Figure 1: Scree Plot of PCA Using the Absolute Data

Figure 2: Scree Plot of PCA Using the Relative Data

The extracted four components explain 82.14 and 81.85 of the cumulative percent of the variance in

the absolute and relative data, respectively. The components, eigenvalues, and percentage of variance

explained for the absolute and relative school data are presented below:

30

Table 5: Total Variance Explained by the Components Using the Absolute School Data

Table 6: Total Variance Explained by the Six Indicators in the Relative School Data

The next several tables depict the communalities after extraction. Communality for a variable is the

variance accounted for by the component (Tabachnick & Fidell, 2007), which indicates the

proportion of common variance present in the variable. Once the factors are extracted, the

communalities indicate how much variance is, in fact, common. The communalities after extraction

indicate the amount of variance in each variable that can be explained by the retained components

(Field, 2005). As such, a community value of 1 would indicate the variable has no specific variance

and a communality of 0 would indicate that the variable shares none of its variance with any of the

other variables (Field, 2005). For the PCA using the absolute and relative school data, the values are

all above .6, and the average communality after extraction is .821 and .818 respectively. Of note, the

communalities for the variables counselor FTE to student enrollment and holding power are very

close to 1 (>.98), which indicates those variables have little specific variance. The communalities are

listed in Tables 7 and 8. The values listed in the column labelled initial are the communalities prior

Total % of Variance Cumulative % Total % of Variance Cumulative % Total % of Variance Cumulative %

1 1.909 31.81 31.81 1.91 31.81 31.81 1.59 26.57 26.57

2 1.166 19.43 51.23 1.17 19.43 51.23 1.32 22.06 48.63

3 .990 16.50 67.73 0.99 16.50 67.73 1.01 16.82 65.45

4 .864 14.41 82.14 0.86 14.41 82.14 1.00 16.69 82.14

5 .602 10.03 92.17

6 .470 7.83 100.00

Extraction Method: Principal Component Analysis.

Component

Initial Eigenvalues Extraction Sums of Squared Loadings Rotation Sums of Squared Loadings

Total % of Variance Cumulative % Total % of Variance Cumulative % Total % of Variance Cumulative %

1 1.954 32.571 32.571 1.954 32.571 32.571 1.544 25.735 25.735

2 1.093 18.210 50.782 1.093 18.210 50.782 1.352 22.536 48.271

3 .977 16.291 67.072 .977 16.291 67.072 1.014 16.895 65.166

4 .886 14.774 81.846 .886 14.774 81.846 1.001 16.680 81.846

5 .615 10.257 92.103

6 .474 7.897 100.000

Component

Initial Eigenvalues Extraction Sums of Squared Loadings Rotation Sums of Squared Loadings


31

to extraction and, hence, are all 1 since PCA operates on the assumption that all variance is

common. The values in the extracted column are the communalities after extraction and depict the

amount of variance in each variable that is explained by the retained components.

Table 7: Communalities of PCA Using the Absolute School Data

Table 8: Communalities of PCA Using Relative School Data

Tables 9 and 10 are the component matrixes. The component matrix shows the loadings of each

indicator onto each component. Since we used factor rotation to improve the interpretation, we

discuss the results from the rotated component matrix further below.

Table 9: Component Matrix Using Absolute School Dataa

Indicators Initial Extraction

z_abs_pctenroll 1.000 .640

z_abs_fteenroll 1.000 .990

wt_abs_SATAC

T1.000 .739

wt_abs_apib 1.000 .739

z_abs_gradrate 1.000 .823

z_abs_holdingp

ower1.000 .999


Indicators Initial Extraction

z_rel_pctenroll 1.000 .623

z_rel_fteenroll 1.000 .999

wt_rel_SATACT 1.000 .716

wt_rel_apib 1.000 .791

z_rel_gradrate 1.000 .791

z_rel_holdingpo

wer1.000 .989


Indicators 1 2 3 4

z_abs_pctenroll .731 .272 -.163

z_abs_fteenroll .194 .654 .273 .671

wt_abs_SATACT .773 -.322 .188

wt_abs_apib .679 -.470 -.128 .201

z_abs_gradrate .501 .548 -.514

z_abs_holdingpower .162 -.197 .941 -.218


a. 4 components extracted.

32

Table 10: Component Matrix Using the Relative School Dataa

As demonstrated in the component matrixes, once factors are extracted it is possible to determine

which items load heavily onto which components. However, it is common for most variables to load

heavily onto the most important component and to load slightly on less important components,

which makes interpretation difficult. A technique used to address this characteristic is referred to as

factor rotation; this technique is used to discriminate between the components. Factor rotation

effectively makes adjustments to the calculations so that variables load maximally to only one factor.

To aid in interpretation of the loadings and components, we orthogonally rotated the factors using

the varimax method. Orthogonal rotation rotates the factors while keeping them independent.

(Before items are rotated, they are also independent; this option merely ensures that factors remain

uncorrelated.) The varimax method attempts to maximize the diffusion of loadings within factors,

effectively loading a smaller number of variables onto each factor (Field, 2005). Varimax is typically

recommended in analysis when the underlying factors are not theoretically related, that is, that the

factors are independent of each other (Field, 2005). In this case, we selected varimax to generate

components that are independent of each other, so that the subsequent scores and rankings

eliminate correlations between the components, so that each component makes a unique

contribution to the final score, and inter-correlations between variables do not influence the weight

of the variable in determining the final score. We compared the results from the PCA analysis with

the judgmental and equal rankings, both of which do not account for the relationships between the

items, to assess the differences in the ranking.

The component loadings are a gauge of the substantive importance of a given variable to a

component and therefore can be used to assign variables to components.4 Stevens (1992)

recommends interpreting factor loadings with an absolute value greater than .4. In effect, values of

.4 explain roughly 16 percent of the variance in the variable. Based on these recommendations, for

4 It is possible to assess the statistical significance of the factor loadings; however, we did not directly test for significance for two reasons (1) for our

intention of examining the variation in computing weights, statistical significance is irrelevant since we have a fixed sub-population and are not

making inferences about what factors exist in the population, and (2) SPSS does not produce significant tests of factor loadings. Furthermore, the

statistical significance gives little indication of the substantive importance of a variable to a component. Nonetheless, for those interested, Stevens

(1992) produced a table of critical values that can be used to assess the significance of factor loadings. Based on a sample size of 600, the values

should be greater than .21 and for 1000 it should be greater than .162. Based on this criteria, almost all of the loadings are likely statistically

significant.

Indicators 1 2 3 4

z_rel_pctenroll .708 .291 -.123 -.149

z_rel_fteenroll .657 .593 .464

wt_rel_SATACT .803 -.206 .170

wt_rel_apib .682 -.290 .488

z_rel_gradrate .524 .538 -.116 -.462

z_rel_holdingpower .262 -.400 .770 -.411


a. 4 components extracted.

33

the absolute school data, the weighted SAT/ACT and AP/IB items substantively load onto

component one; the college enrollment and graduation rates substantively load onto component

two; the counselor FTE ratio substantively loads onto component three; and holding power

substantively loads onto component four. For the relative school data, there is a similar pattern;

however, the items for components three and four are switched (i.e., holding power substantively

loads onto component three, and counselor FTE ratio substantively loads onto component four).

Tables 11 and 12 below are the rotated component matrixes.

Table 11: Rotated Component Matrix Using Absolute Dataa

Table 12: Rotated Component Matrix Using Relative Dataa

In effect, these loadings represent the extent that the items contribute to the components. The

component scores are derived from the factor loadings. The component scores were then multiplied

by their respective substantive importance, determined by the rotated eigenvalues. To calculate the

component scores for schools, we used the regression method option in SPSS 20. We then

multiplied the component scores by the respective rotated eigenvalues to rank order the schools in

both the relative and absolute list.

For both the absolute and relative list rankings, the PCA based on four components extracted is

correlated with the judgmental weighing scheme at r=0.93 and with the equal weighting scheme at

1 2 3 4

z_abs_pctenroll .391 .684 .139

z_abs_fteenroll .991

wt_abs_SATACT .839 .165

wt_abs_apib .857

z_abs_gradrate .906

z_abs_holdingpower .998

Indicators

Component


Rotation Method: Varimax with Kaiser Normalization.

a. Rotation converged in 5 iterations.

Indicators 1 2 3 4

z_rel_pctenroll .372 .696

z_rel_fteenroll .998

wt_rel_SATACT .780 .276 .172

wt_rel_apib .889

z_rel_gradrate .889

z_rel_holdingpower .992

a. Rotation converged in 5 iterations.

Component


Rotation Method: Varimax with Kaiser Normalization.

34

r=.94. We also ran a PCA model in which we extracted six components to retain all possible

variance in the components. The correlations from these models with the judgmental and equal

weighting schemes are very similar, resulting in no change or only a one-hundredth change in the

correlation values.

Discussion and implications

These “sensitivity tests” address two major critiques of rank ordered lists such as these. First, the

rankings can be sensitive to their respective weights (e.g., Bobko et al., 2007; Gladwell, 2011).

Second, without accounting for the multicollinearity among the variables, the explicit weighting

scheme may vary vastly from the actual contribution of the variables (e.g., Webster, 1999).

The ranking results produced from the equal weighting schemes for both the absolute and relative

lists demonstrates that the judgmental and equal weighting schemes result in very similar rank order

of schools (r=.99 ). In this case, using the judgmental weights did not result in a substantially

different ranking vs. the equal weighting scheme. However, the sensitivity of the judgmental

weighting is dependent on the relative size of the weights (regarding proportionality). In other

words, substantially changing the weights on one or more items would likely result in fewer

similarities between the rankings produced by the judgmental and equal weighting schemes.

However, again, under the conditions that we compared these two composite weighting schemes,

the subsequent high school rankings are very similar. These results are unique to our data, including

the items and weights that we used for the analysis. But, these results are consistent with other

research demonstrating that in many cases there are high correlations between judgmental and equal

weighting schemes (Webster, 1999).

We now turn to the second critique regarding multicollinearity. In essence, this critique claims that if

variables are highly similar (determined by their variance explained and their correlations), then the

stated weights do not in actuality represent how the variables contribute to the composite scores.

Critics claim that misrepresentation of the weighting schemes can have serious implications for

educational policies (e.g., Webster, 1999). We conducted a PCA to (1) to assess the empirical

loadings of the items on components and (2) produced a ranking of schools based on the

component scores and the respective substantive importance of the components. We compared the

ranking list derived from the PCA to the judgmental and equal weighting schemes to assess the

consistency across the three different weighting options. (As an aside, we also used the findings to

inform our judgmental weights.)

The results indicate that the three different weighting options resulted in a very similar ranking of

public high schools for both the absolute and relative schools. There are high levels of correlation

across all three weighting schemes for both the absolute and relative schools, especially between the

judgmental and equal weighting schemes. Using a PCA to account for the multicollinearity between

the items and empirically derived item loadings, components and component weights also resulted

in a similar public high school ranking list as the judgmental and equal weight schemes (r=.93 and

r=.94). Yet, despite these high levels of correlations, there is still some variation among the rankings

35

of schools based on these different methods. For example, there are some changes in the top 10

schools using the different methods, and, in fact, the top two schools change. However, these

analyses do demonstrate that the public high school rankings, in this case, are not particularly

sensitive to the differences between our judgmental, equal and PCA weighting schemes.

It should be noted that these results are empirically derived from our data and unique to it. Future

researchers who construct composite scores to produce high school (or for that matter, college)

ranking lists should examine the sensitivity of their lists to their weighting schemes and should

examine issues arising from multicollinearity.

36

References

Baker, B.D., & Green, P.C. (2009) Conceptions, measurement, and application of educational adequacy and equal educational opportunity. In G. Skyes, B. Schneider, & D. Plank (Eds.), Handbook of education policy research. New York, NY: Routledge.

Balfanz, R., & Legters, N. (2004). How many central city high schools have a severe dropout problem, where are they located and who attends them? Initial estimates using the Common Core of Data. In G. Orfield (Ed.), Dropouts in America: Confronting the crisis. Cambridge, MA: Harvard Educational Publishing.

Berne, R., & Stiefel, L. (1984). The measurement of equity in school finance. Baltimore, MD: John Hopkins University Press.

Bobko, P., Roth, P., & Buster, M. (2007). The usefulness of unit weights in creating composite scores: A literature review, application to content validity, and meta-analysis. Organizational Research Methods, 10, 689-709.

Cobb, C. (2002). Performance-based accountability systems for public education. Unpublished manuscript, The New Hampshire Center for Public Policy Studies.

Coleman, J.S., Campbell, E.Q., Hobson, C.J., McPartland, J.M., Mood, A.M., Weinfield, F.D., et al.

(1966). Equality of educational opportunity. Washington, DC: U.S. Government Printing Office. Comrey, A.L., & Lee, H.B. (1992). A first course in factor analysis (2nd edition). Hillsdale, NJ: Erlbaum. Di Carlo, M. (May, 2013). A quick look at “best high school” rankings. Shanker Blog. http://shankerblog.org/?p=8328 Duhon, C., Kuriki, A., Chen, J., & Noel, A. (2013). Identifying top-performing high schools for the “best high

school” rankings. Washington, DC: American Institutes for Research Field, A. (2005). Discovering statistics using SPSS (2nd edition). Thousand Oaks, CA: Sage Publications. Gladwell, M. (February, 2011). The order of things: What college rankings really tell us. The New Yorker. http://www.newyorker.com/magazine/2011/02/14/the-order-of-things Goldhaber, D. (2002). The mystery of good teaching. Education Next, (Spring, 2002), 1-7. Grubb, W.N. (2009). The money myth: School resources, outcomes and equity. Russell Sage Foundation: New York. Hall, D. (2001). How states pursue public school accountability and measure performance of public schools, school

year 1998-1999. Unpublished manuscript, The New Hampshire Center for Public Policy Studies.

http://shankerblog.org/?p=8328

http://www.newyorker.com/magazine/2011/02/14/the-order-of-things

37

Harris, D.N. (2010). Education production functions: Concepts. In D. Brewer & P. McEwan (Eds.), Economics of education. San Diego, CA: Elsevier Ltd.

Hucheson, G., & Sonfroniou, N. (1999). The multivariate social scientist. London: Sage.

Jolliffe, I.T. (1972). Discarding variables in a principal component analysis, I: artificial data. Applied Statistics, 21, 160-173.

Jolliffe, I.T. (1986). Principal component analysis. New York: Springer.

Kaiser, H. F. (1974). An index of factorial simplicity. Psychometrika, 39, 31-36.

Reardon, S. (2014). Education. In State of the Union: The poverty and inequality report. Pathways: a magazine on poverty, inequality, and social policy. Stanford. Stanford Center on Poverty and Inequality. Special issue 2014.

Rumberger, R., & Palardy, G. (2005). Test scores, dropout rates, and transfer rates as alternative indicators of high school performance. American Education Research Journal, 42(1), 3-42.

Sirin, S.R. (2005). Socioeconomic status and academic achievement: A meta-analytic review of research. Review of Educational Research, 75(3), 417-453. doi:10.3102/00346543075003417.

Stinebrickner, R., & Stinebrickner, T. (2003). Understanding educational outcomes of students from low-income families. Journal of Human Resources, 38(3), 591-617. Streib, L. (2013, May 6). America’s Best High Schools 2013: Behind the Rankings. Retrieved from

http://www.thedailybeast.com/articles/2013/05/06/america-s-best-high-schools-2013-behind-the-rankings.html

Sutton, A., & Soderstrom, I. (1999). Predicting elementary and secondary school achievement with school-related and demographic factors. The Journal of Educational Research, 92, 330-338. Tabachnick, B., & Fidell, L. (2007). Using multivariate statistics (5th edition). Boston: Allyn & Bacon.

Toutkoushian, R.K., & Curtis, T.C. (2005). Effects of socioeconomic factors on public high school outcomes and rankings. The Journal of Educational Research, 98(5), 259-271.

Webster, T.J. (1999). A principal component analysis of the U.S. News & World Report tier rankings of

colleges and universities. Economics of Education Review, 20, 235‐244.

Documents

Newsweek's Top High Schools Methodology