10
Engage i measure i empower i transform the learner first. January 2017 Standardized Tests and Authentic Mixed Method Assessment Why the results are different

StandardizedTestsandAuthenticMixedMethod Assessment ... · If we had significant proportions of students scoring below the benchmark for the state standardized test receiving AMMA

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Engage i measure i empower i transformthe learner first.™

January 2017

Standardized Tests and AuthenticMixedMethodAssessment

Why the results are different

© 2017 Learner First LLC

Written ByJoanne McEachen and Jennifer Mersman

Published ByLearner First LLCSeattle, WA

The Learner First 1 2017

As organizations go through the process of whole-system change, resistance naturallyoccurs and counters the forces for change. According to many change models, this isan expected part of the cycle of change (Lewin, 1947;;Kotter, 1996). In our experiencehelping school systems transform from using a single indicator of student performanceto using Authentic Mixed Method Assessment (AMMA), we have observed certainpatterns of resistance:

• Beliefs that people thought they have changed reveal themselves to still bepervasive,

• Reverting to the old way of doing things because it’s comfortable,• And lack of trust in the results.

As we get into the deeper levels of the Cultural Iceberg (see figure 1), we seeresistance to the underlying beliefs changing. The test score has been the goldstandard for so long that the system has demanded that all stakeholders be beholden toit. In fact, these narrowly defined measures of success can be barriers to whole-systemchange (Fullan & Langworthy, 2014).

Part of going deeper into the Cultural Iceberg is challenging the belief that the test scoreis the only “true” performance measure against which all other success indicatorsshould be gauged. This evaluative criterion simply does not hold in the logic of authenticassessment (Wiggins, 1990), nor does it fit with the Every Student Succeeds Act(Darling-Hammond & Hill, 2015;;Darling-Hammond, Bae, Cook-Harvey, Lam, Mercer,Podolsky, Stosich, 2016). When looking at multiple points of evidence for studentsuccess, one cannot let a single indicator reign supreme. Test scores are an importantpart of a fuller picture of student success but not the most important indicator.

Figure 1: The Cultural Iceberg.

The Learner First 2 2017

The singular reliance on standardized test scores as a measure of student achievementis fundamentally flawed—both conceptually and practically (Harris, Harris, & Smith,2012). We understand how we got here with the No Child Left Behind Act, and schoolleaders agree that we need changed mindsets about this. But as we undergo wholesystem reform, we still see reversion back to a test centric way of thinking;; leaders wantto be free of it but they still find themselves rooted in it. This is why deeper levels of theCultural Iceberg are essential in sustained system change. You can’t shift thinking andpractice without shifting which outcome measures you pay the most attention to;; theymust shift in tandem, and as they do, recalibration naturally must occur.

Whole system change involves the process of calibration—the system is learning a newlanguage, and developing a shared understanding. Persistence in this will driveconsistency. At a system level, this takes time as sensing, diffusion, and calibrationoccur (Cummings & Worely, 2014). However, this occurs at different rates across thesystem and takes time to take hold. Teachers using a new process like AMMA willexperience a learning curve as they calibrate their thinking and judgment against others.Examples of diffusing this calibration process is schools using Professional LearningCommunity time for change team members to train and transfer learning on AMMAprocess, discuss how to implement rubrics, and collaborate on ratings. If calibration isnot given time, and strategies for new ways of knowing and thinking are prematurelyabandoned, this can create a “culture of churn” where subsequent change initiatives areunwittingly being set up for eventual failure (Blad, 2016).

AMMA is about defining what is important to measure, harder to measure, and essentialto measure if we really want to discuss student success beyond a single metric ofperformance (Davidson & McEachen 2015).

Indicators of Student Success and Construct Validity

Reliance on test scores as the gold standard for evaluating student success makesperfect sense when we analyze the historical and political factors that fostered it (NEA,2011;;Popham, 1999). From a more technical, measurement standpoint, the security inthe test score derives from perceptions of increased psychometric quality. But pullingback the lens to the underpinning of all validity—construct validity—using a singleindicator to measure a construct (such as student success) is much less robust thanusing multiple indicators. Construct validity is reduced when we fail to include importantfacets of that construct in its measurement: “although any piece of evidence may befallible and some may be spurious, the continuing construct validation process attemptsto appraise, and take into account, the nature and extent of such distortion in theevoloving validity judgment” (Messick, 1990, p. 19). To measure student success with

The Learner First 3 2017

solely one indicator is what Messick (1989) calls underrepresentation—a threat toconstruct validity.

Construct validity hinges on fully defining the concept of interest, and adequatelymeasuring as many aspects of that concept as possible. We understand this in the realworld of work in organizations, where taxonmies of work performance include, but arenot limited to countable outcomes (absentiesm, units produced). Aspects likeorganizational citizenship behavior, communication skills and leadership skills flesh outthe performance domain to give a broader, more comprehenisve view of performance(Guion, 2011). With student peformance (as all performance), test scores take a narrow,specific slice into that construct space. Other bits of evidence flesh that out and providea fuller, richer (and therefore more accurate) undertanding of the concept as a whole.Compare how much more of the construct space (red circle) is being accounted for andfleshed out in Figure 3 than in Figure 2.

Figure 2: Construct space examined through testing alone.

Figure 3: Construct space examined through multiple kinds of indicators.

The Learner First 4 2017

When diagnosing an illness, doctors look at more than just blood pressure;; they takeinto account history, how the patient feels, blood tests, etc. Or with a new car purchase,several factors must be synthesized to arrive at the best choice according to need: cost,reliability, gas mileage, cargo/passenger capacity, etc. A car could have a consistent,valid measure of gas mileage of 22 mpg on the highway (a single indicator) but not beable to transport a soccer team, or it may stall often leading to frequent tardiness. Weunderstand the single indicator issue in real-life examples, yet continue to makeeducational decisions and policies based on mono-method measures of achievement(Brookhart, 2009).

Authentic Mixed Method Assessment and Test Scores areNot Equivalent

During the 2015-2016 school year, The Learner First partnered with a medium sizedurban school district to implement AMMA. By the end of that year, student performancemeasured against AMMA had notably improved. That increase was not perfectlyreflected by the test scores, but we would not expect them to be.

Evaluating AMMA ratings against test scores comes out of the test-centric mindset;; theyare related but not interchangeable. Indeed, we see a correlation between Spring 2016AMMA ratings for the district math target with 2016 state standardized test scores at .4.This approaches the association between teacher assigned grades and test scoreswhich are of the magnitude of about .5 to .6 (Bowers, 2011), and one would be hardpressed to argue the equivalency between grades and standardized test scores.

If we had significant proportions of students scoring below the benchmark for the statestandardized test receiving AMMA ratings of well on track or higher, this would be causefor concern to the fidelity of the AMMA process. But, in fact, the graph below shows anexpected pattern of results when 2016 state standardized test scores are cross-tabbedwith 2016 Spring AMMA ratings. We have the highest proportions of well on track orbetter scoring above benchmark;;we have the highest proportions of substantially offtrack or not yet on track scoring below benchmark. The fact that these AMMA ratingsand state standardized test scores did not map perfectly onto one another isreasonable, given the process of calibration discussed above, as well as the morecomprehensive nature of AMMA. For example, the 17% of students rated as well ontrack or better who scored unsatisfactory on the state standardized test reveals thatstudents made beyond expected growth in the competence and application of gradelevel math concepts, yet did not subsequently score above benchmark in the spring.This is showing performance in math in other areas that standardized test scorescannot (and did not) tap.

The Learner First 5 2017

Figure 4: AMMA ratings as compared with the state standardized test results for 2016.

Based on teacher observations of students grasping grade level math concepts andapplying them in their lives, student reports of how relevant math is to their lives, andconversations with parents and families, teachers saw acceleration beyond what we’dexpect in a year. The fact that these accelerations did not map precisely onto the statetest results is not evidence to devalue the ratings;; these accelerations were matcheddata. The rubric for AMMA is based on a broader definition of success than thebenchmarks on the state test. Overall, (cross-sectionally), test scores declined from lastschool year to this school year. But when we look at all the available evidence goinginto AMMA ratings longitudinally over the course of a year, we see net gain of 19% ofthe math target demographic demonstrating beyond expected growth in competenceand application of math. This kind of analysis breaks down what is happening within thatcross-sectional data by using matched data, and underscores the importance of lookingat all the available evidence.

By focusing on more meaningful outcomes like understanding and applying concepts inthe real world, and making math more relevant to students’ lives, teachers created thelearning experiences that helped kids grasp concepts. This is evidenced by 22% ofstudents in the math target demographic gaining an interest in math (TLF MathRelevance survey, matched data), teachers’ evaluation of in class work/assignments,teacher observations, and conversations with students and families about their

The Learner First 6 2017

everyday application of math (all of which are part of the rubric, but are not measuredquantitatively).

Figure 5: Students above and below of the AMMA benchmark from Fall 2015 to Spring2016.

Don’t be bossed around by one metric. Take a balanced approach to measuringperformance like we do in real-world contexts. Put it in perspective. Trust your teachers.Trust yourselves as you begin to embrace a more nuanced and comprehensive way ofdefining success and knowing whether you’ve attained it. This results in a broader, fullerpicture and understanding of student performance as opposed to a sharply focuseddetail of one narrow aspect.

The Learner First 7 2017

References

Blad, E. (2016). New law widens lens to gauge school success. Education Week,35(15) page 15.

Bowers, A.J. (2011) What's in a Grade? The Multidimensional Nature of What TeacherAssigned Grades Assess in High School. Educational Research & Evaluation,17(3), 141-159.

Brookhart, S. M. (2009). The many meanings of “multiple measures”. EducationalLeadership 67(3), pp 6-12.

Cummings, T. G., & Worley, C. G. (2014). Organization Development and Change.Stanford, CT: Cenage Learning.

Darling-Hammond, L., Bae, S., Cook-Harvey, C. M., Lam, L., Mercer, C., Podolsky, A.,& Stosich, E. L. (2016). Pathways to New Accountability Through Every StudentSucceeds Act. Retrieved from Learning Policy Institute Website:https://learningpolicyinstitute.org

Darling-Hammond, L. & Hill, P. T. (2015). Accountability and the federal role: A thirdway on ESSA. Retrieved from Stanford Center For Opportunity Policy InEducation Website: https://edpolicy.stanford.edu/publications

Davidson, E. J., & McEachen, J. (2015). Making the Important Measurable: Not theMeasurable Important. Seattle, WA: The Learner First.

Guion, R. M. (2011). Assessment, measurement, and prediction for personneldecisions. New York: Routledge.

Kotter, J. P. (1996). Leading Change. Boston, MA. Harvard Business School Press.

Harris, P., Harris, J., & Smith, B. M. (2012). Standardized Tests Do Not EffectivelyMeasure Student Achievement. In D. Bryfonski (Ed.), At Issue. StandardizedTesting. Detroit: Greenhaven Press. (Reprinted from Chapter 3: The Tests Don'tMeasure Achievement Adequately, The Myths of Standardized Tests: Why TheyDon't Tell You What You Think They Do, pp. 33-45, 2011) Retrieved fromhttp://libraries.state.ma.us/login?gwurl=http://ic.galegroup.com/ic/ovic/ViewpointsDetailsPage/ViewpointsDetailsWindow?query=&prodId=OVIC&contentModules=&dviSelectedPage=&displayGroupName=Viewpoints&limiter=&disableHighlighting=&displayGroups=&sortBy=&search_within_results=&zid=&p=OVIC&action=2&

The Learner First 8 2017

catId=&activityType=&documentId=GALE%7CEJ3010478218&source=Bookmark&u=mlin_m_highrock&jsid=49ea537c93a441192572947cf3e64898

Lewin, K (1947). Frontiers of Group Dynamics: Concept, method and reality in socialscience, social equilibria, and social change. Human Relations, 1, 5–41

Messick, S. (1989). Meaning and values in test validation: The science and ethics ofassessment. Educational Researcher, 18(2), 5–11.

Messick, S. (1990). Validity of test interpretation and use (Report number ETS-RR-90-11). Educational Testing Service, Princeton N.J.

Moses, M. S., & Nanna, M. J. (2007). The testing culture and the persistence of highstakes testing reforms. Education and Culture, 23(1), 55-72.

National Education Association. (2011). Beyond Two Test Scores: Multiple Measures ofStudent Learning and School Accountability. Retrieved fromhttp://www.nea.org/assets/docs/PB38beyondtwotestscores2011.pdf

Popham, J. W. (1999) Why standardized tests don’t measure educational quality. UsingStandards and Assessments, 56(6), 8-15.

Wiggins, G. (1990). The case for authentic assessment. Practical AssessmentResearch & Evaluation, 2(2). Retrieved from Practical Assessment, Research &Evaluation website http://pareonline.net/getvn.asp?v=2&n=2