192
The Pennsylvania State University The Graduate School College of Education EXAMINING THE RELATIONSHIPS BETWEEN STUDENT ACHIEVEMENT AND TEACHER MONITORING AND EVALUATION IN LOWER SECONDARY AND SECONDARY SCHOOLS: A MULTINATIONAL STUDY A Dissertation in Educational Theory and Policy by Gulab Khan © 2013 Gulab Khan Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy December 2013

Teacher Evaluation Practices and Purposes: An OECD

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

The Pennsylvania State University

The Graduate School

College of Education

EXAMINING THE RELATIONSHIPS BETWEEN STUDENT ACHIEVEMENT

AND TEACHER MONITORING AND EVALUATION IN LOWER SECONDARY

AND SECONDARY SCHOOLS: A MULTINATIONAL STUDY

A Dissertation in

Educational Theory and Policy

by

Gulab Khan

© 2013 Gulab Khan

Submitted in Partial Fulfillment

of the Requirements

for the Degree of

Doctor of Philosophy

December 2013

ii

The dissertation of Gulab Khan was reviewed and approved* by the following:

Mindy L. Kornhaber

Associate Professor of Education

Dissertation Co-Adviser

Co-Chair of Committee

Liang Zhang

Associate Professor of Education and Labor Studies

Dissertation Co-Adviser

Co-Chair of Committee

Hoi Suen

Distinguished Professor of Educational Psychology

Soo-yong Byun

Assistant Professor of Education

Gerald LeTendre

Professor of Education

Department Head, Education Policy Studies

*Signatures are on file in the Graduate School

iii

ABSTRACT

Teacher quality is a significant determinant of student achievement in schools. One

way through which schools endeavor to improve the quality of their teachers, and hence

student achievement, is by evaluating them, identifying their professional needs, and making

them accountable for the quality of their practice. While there is a general agreement that

teachers should be monitored and evaluated, there is variation in the approaches and purposes

of the process across schools and educational contexts. This dissertation responds to the

research question, “How do teacher monitoring and evaluation practices and purposes

associate with student achievement in mathematics, science, and reading in lower secondary

and secondary schools.” The study employs Ordinary Least Squares as it analytical approach

and uses data and information in 21 countries from the Program for International Student

Assessment (PISA) and Teaching and Learning International Survey (TALIS).

Findings show that the developmental approaches to teacher evaluation in the form of

evaluative focus in principals’ pedagogical role that include classroom observations,

suggesting teachers for improvement, and informing teachers about possibilities for updating

their knowledge and skills do not associate significantly with student achievement in all three

subjects. Schools’ use of student data for instructional improvement also does not associate

significantly with student achievement in all three subjects. Monitoring of teachers using

student achievement and principal and staff observations relate positively to student

achievement in reading. The study finds mixed results for high-stakes approaches to teacher

evaluation. Public accountability establishes a positive relationship with student achievement

in all three subjects. However, the use of student assessments for teacher evaluation and

judging teacher effectiveness do not relate significantly to student achievement in

mathematics. In reading and science, such uses of student assessments associate negatively

iv

with student achievement. The tracking of student assessments by an administrative authority

develops a negative but insignificant relationship with student achievement in mathematics

and reading, and an insignificant positive relationship in science.

The evidence in this study only confirms the complexity of teacher monitoring and

evaluation practices and purposes while exploring their potential in raising student

achievement in schools. The study suggests that the use of student assessments as an

evidence of teacher performance should be avoided especially in high-stakes approaches to

teacher evaluation. The study further suggests that the right mix of developmental and high-

stakes approaches and purposes to monitoring and evaluating teachers should be driven by

evidence obtained through rigorous research in indigenous settings.

v

Table of Contents

LIST OF TABLES ................................................................................................................. viii

ACKNOWLEDGEMENTS ..................................................................................................... ix

Chapter 1. INTRODUCTION ................................................................................................... 1

Statement of Purpose ............................................................................................................. 2

Significance of the Study ....................................................................................................... 4

Research Questions ................................................................................................................ 6

Teacher Evaluation: Unpacking the Constructs .................................................................... 7

Evaluation and assessment. ................................................................................................ 7

Evaluation and monitoring. ................................................................................................ 7

Evaluation and supervision. ............................................................................................... 8

Evaluation and accountability. ........................................................................................... 9

Teacher Evaluation: Purposes, Approaches, and Outcomes ............................................... 10

Instruments and Evaluators .................................................................................................. 11

Student achievement. ....................................................................................................... 11

Teacher peer reviews. ...................................................................................................... 12

Classroom observations. .................................................................................................. 12

Evaluators. ........................................................................................................................ 13

Chapter 2. LITERATURE REVIEW, CONCEPTUAL FRAMEWORK, AND

RESEARCH HYPOTHESES .................................................................................................. 15

Teacher Monitoring and Evaluation in Cross-National Perspectives .................................. 15

Teacher evaluation as covered in the OECD project 2002-04. ........................................ 16

Teacher evaluation: Findings from the PISA 2009. ......................................................... 18

Teacher evaluation: Findings from the TALIS 2008. ...................................................... 21

Teacher Evaluation: Empirical Evidence ............................................................................ 25

Developmental teacher evaluation and student achievement. ......................................... 26

High-stakes teacher evaluation and student achievement. ............................................... 34

Interactions and student achievement. ............................................................................. 40

Conceptual Framework and Research Hypotheses .............................................................. 42

Chapter 3. DATA AND METHODS ...................................................................................... 46

Datasets ................................................................................................................................ 46

vi

Sample and Sampling Strategy ............................................................................................ 48

Variables and Missing Data Management ........................................................................... 51

Developmental. ................................................................................................................ 52

High-stakes. ...................................................................................................................... 55

Interactions ....................................................................................................................... 56

Control variables at student, school, and country levels .................................................. 58

Missing data management. ............................................................................................... 59

Descriptive statistics. ....................................................................................................... 60

Data Reduction .................................................................................................................... 71

Methods ............................................................................................................................... 75

Chapter 4. RESULTS AND ANALYSES .............................................................................. 78

Determinants of Student Achievement in Mathematics ...................................................... 78

Developmental and high-stakes approaches to teacher evaluation .................................. 78

Control variables in models 2 and 3 in mathematics ....................................................... 84

Determinants of Student Achievement in Science .............................................................. 88

Developmental and high-stakes approaches to teacher evaluation .................................. 88

Control variables in models 2 and 3 in science ................................................................ 94

Determinants of Student Achievement in Reading ............................................................. 95

Developmental and high-stakes approaches to teacher evaluation .................................. 96

Control variables in models 2 and 3 in reading ............................................................. 101

Chapter 5. DISCUSSION, IMPLICATIONS, AND CONCLUSIONS ................................ 104

Developmental Approaches to Teacher Evaluation .......................................................... 106

Monitoring in test language. .......................................................................................... 106

Principals’ pedagogical role. .......................................................................................... 108

Use of student assessment for instructional improvement. ............................................ 111

High-Stakes Approaches to Teacher Evaluation ............................................................... 115

Public accountability. ..................................................................................................... 115

Use of student assessments to evaluate and judge teachers, and administrative tracking.

........................................................................................................................................ 117

Interactions ........................................................................................................................ 119

Teacher Evaluation: Country Variables ............................................................................. 121

vii

Policy Implications and Recommendations ....................................................................... 123

Limitations of the Study .................................................................................................... 128

Recommendations for Further Research ........................................................................... 130

Conclusions........................................................................................................................ 132

References ............................................................................................................................. 136

Appendix A: Teacher Evaluations in Public Schools (2002) ................................................ 151

Appendix B: How School Systems use Student Assessments .............................................. 162

Appendix C: Criteria for Teacher Appraisal and Feedback (2007-08) ................................. 163

Appendix D: Impact of Teacher Appraisal and Feedback upon Teaching (2007-08) .......... 166

Appendix E: Outcomes of Teacher Appraisal and Feedback (2007-08) ............................... 168

Appendix F: Variable Definitions and Measurements .......................................................... 170

Appendix G: Principal Component Analysis of Criteria for Teacher Appraisal and

Feedback ................................................................................................................................ 175

Appendix H: Principal Component Analysis of Outcomes and Impacts of Teacher

Appraisal and Feedback ........................................................................................................ 177

viii

LIST OF TABLES

Table 3.1: Countries and Cases...............................................................................................50

Table 3.2: Descriptive Statistics for Main and Control Variables..........................................61

Table 3.3: Frequencies and Percentages of Main Categorical Variables................................64

Table 3.4: Correlations among Main Predictors.....................................................................69

Table 4.1: Determinants of Student Achievement in Mathematics........................................79

Table 4.2: Determinants of Student Achievement in Science.................................................89

Table 4.3: Determinants of Student Achievement in Reading................................................97

Table G1: Principal Component Analysis of Criteria for Teacher Appraisal and

Feedback..............................................................................................................175

Table G2: Promax Rotated Component Loadings of Criteria for Teacher Appraisal and

Feedback..............................................................................................................175

Table G3: Scoring Coefficients for Components on Criteria for Teacher Appraisal and

Feedback..............................................................................................................176

Table H1: Principal Component Analysis of Outcomes and Impacts of Teacher Appraisal

and Feedback.......................................................................................................177

Table H2: Promax Rotated Component Loadings of Outcomes and Impacts of Teacher

Appraisal and Feedback.......................................................................................177

Table H3: Scoring Coefficients for Component on Outcomes and Impacts of Teacher

Appraisal and Feedback……………….................................................................178

ix

ACKNOWLEDGEMENTS

Here comes the successful touchdown on an important milestone in my life—Doctor

of Philosophy! I owe this milestone to the support and inspiration from a variety of sources. I

will venture here to account with all my humility and gratitude the contributions from the

many individuals and a host of institutions that enabled me to make this successful

touchdown.

It is for Senator James William Fulbright (late), the founder of the prestigious

Fulbright scholarship program, to whom I owe my success in the first place. Senator

Fulbright, I am truly grateful to you, for you made it possible for a person of modest

background to complete an advanced degree in a world-class academic setting. I must revere

the opportunity of a lifetime provided by the people of the United States through the

Fulbright program and associated agencies in the United States and Pakistan. I acknowledge

the opportunity provided to me by the Bureau of Educational and Cultural Affairs, United

States Department of State through its flagship Fulbright Program for Foreign Students to

study for the degree of Doctor of Philosophy at the Pennsylvania State University. I am

grateful to the Institute of International Education (IIE), and the United States Educational

Foundation in Pakistan (USEFP) who administer the Fulbright scholarship program in

Pakistan. With regard to the successful completion of my dissertation, it is vital that I

appreciate the help from the OECD. I was able to write this dissertation by using two

important sources of information, the Program for International Student Assessment (PISA),

and the Teaching and Learning International Survey (TALIS), from the OECD. Thank you

indeed OECD!

x

My candid thanks and appreciation go to the faculty and staff at the College of

Education, Pennsylvania State University for their compassionate mentoring, and academic

and administrative support throughout. My reverence and utmost gratitude go to my research

advisors and supervisors Professors Liang Zhang, and Mindy L. Kornhaber. Your caring and

professional support provided the necessary spur and guidance, thereby enabling me to

navigate in a smooth fashion all along, leading to the completion of all the requirements for

the degree of Doctor of Philosophy. Any of my future graduate student colleagues who will

have the opportunity to work with you will testify to the fact that you are some awesome

advisors at the College of Education! My gratitude and thanks to you Professors Hoi K. Suen

and Soo-yong Byun, my doctoral dissertation committee members, for your reassurances and

support in my research pursuits as related to my dissertation; critical and rigorous feedback

of yours bumped-up my research to higher levels of intellectual rigor.

My gratitude is also owed to Dr. Jan-e-Alam Khaki (my Master of Education research

supervisor at the Aga Khan University, Institute for Educational Development [AKU-IED])

who sanguinely kept pushing me to pursue further studies in a foreign setting. Thank you Dr.

Khaki for your optimism in my capacity to take on such a rigorous task in life. Thank you

also for providing the reference letters as and when needed all along in my pursuits for a

doctoral program. You were always available with your candid assessment of my abilities

through these reference letters. Speaking of reference letters, I am also grateful to Dr. John

Retallick (my ex-teacher at the AKU-IED), Ms. Khadija Khan, General Manager, AKES, P

in Gilgit-Baltistan, and Mr. Jan Madad, Ex-GM AKES, P in Gilgit-Baltistan for your time

and effort in writing and submitting reference letters in support of my applications for the

Fulbright scholarship and elsewhere. I am also indebted to the senior management at the Aga

xi

Khan Education Service, Pakistan for their approval of a study leave. I owe so much of my

professional career to the AKES, P!

My appreciation and thanks go to all my colleagues and friends here at Penn State for

your company, support, and guidance. Thank you Mehnaz Jehan for your generous support in

taking on some of the first shocks of settling during my initial days and months in State

College. Thank you Jessica Irene Ouédraogo and Cyrille Ouédraogo for your support

whenever I requested. Armend Tahirsylaj, thank you for allowing me to use some of your

computing resources. Haram Jeon, Kristina Brezicha, Pablo Fraser, Saki Ikoma, Sunny

Madahar, and Will Smith, thank you all for your critical feedback on my presentation for the

dissertation defense. Your feedback significantly contributed in making my defense meeting

a success. Thank you Adrienne Henck, Saki Ikoma, Steve Kotok, and Tian Fu for your

suggestions whenever I requested, especially in Facebook chats!

Mr. Ibrahim Shah (Ex-Mukhi Central Jamatkhana, Gilgit, Pakistan), thank you for

providing your unconditional financial guarantee, an important requirement from IIE for my

wife’s travel to and stay in the United States. In the same vein, I am truly grateful to my dear

friend Ghulam Muhammad Shah who provided financial guarantees for my wife’s travel to

and stay in the US. Thank you all my dear friends, especially Ghulam Muhammad Shah, Piar

Karim, Iqbal Barcha, Shams-ul-Haq and many others who provided moral support in times

when I felt most burdened by the challenges of my studies here at Penn State.

My dear brother Shukrullah Baig, and sisters Yasmin Bano, Nasreen Akhtar, Tahira

Parveen, Bibi Salimah, Murad Begum, Waqar-un-Nissa, Meher-un-Nissa, Razia Sultana,

Rohila Aman, and all my loving and lovable nieces and nephews and other members in the

family (all the in-laws and cousins), I love you all and thank you for your sustained support,

xii

prayers, and good wishes for my success. You all are my strength and my finest hope! I

thank you grandfather, Mustajab Shah (late), for your far-sighted vision for your family.

Although I never saw you in person, I believe your decision to migrate to Gilgit from Hunza

has been a significant contribution in what I am today. Your vision has placed me at a

position where I can play my due role in this world efficiently and effectively. I pray for your

salvation and peace. My dear mother and father, Khair-Un-Nissa and Abdullah Khan, I love

you both very much and respect and revere your prayers, the sacrifices, and the pains that

you endured all along to nurture and give comfort to your children. You are one of those

special parents who, though not literate themselves, aspire and struggle to make their

children literate and responsible citizens of this world. I will always be in need of your

prayers and good wishes to be able to live a meaningful life all along.

Even with such tremendous support from the many individuals, my extended family,

and many sources, it would have been clearly beyond my reach to achieve this important

milestone in my life had it not been for one critical source of love and care—my Bibi! I thank

you for your support, encouragement, prods, the uplifting smiles, and for being a great source

of energy and hope to keep me afloat amid the most exacting phases in my life!

xiii

TO,

all those teachers who, under the most difficult of circumstances, lift all children in

their classes to new levels of learning, hope, and success and who do so without regard

to the incentives and penalties

1

Chapter 1. INTRODUCTION

The primary goal of schools is to improve student achievement for all students.

Schools endeavor to achieve this goal by identifying and improving factors that are

significant in relation to student achievement. Evidence shows that teacher quality plays a

critical role in improving student achievement in schools (Barber & Mourshed, 2007;

Borman & Kimball, 2005; Hanushek, 1992, 2003; Hanushek, Kain, Brien, & Rivkin, 2005;

Organization for Economic Cooperation and Development [OECD], 2009a; Rivkin,

Hanushek, & Kain, 2005; Rockoff, 2004; Wright, Horn, & Sanders, 1997). Therefore,

teacher quality has become a driving theme worldwide in educational policy development

and analysis. One way schools can improve the quality of their teachers is by evaluating

them so as to identify their strengths and weakness, develop them professionally, and make

them accountable for their practice (Isoré, 2009; McGreal, 1988; Nolan & Hoover, 2008).

Scholars and policymakers (e.g., Ribas, 2005; Taylor & Tyler, 2011; Toch, 2008)

believe that teacher evaluation is one of the significant approaches to enhance the quality of

education for all students. This belief in the efficacy of evaluating teachers for their

improvement coupled with a push for teacher accountability from various stakeholders has

thrown teacher evaluation into the spotlight of policy-making and practice in recent decades

(Donaldson, 2009; Isoré, 2009; OECD, 2010a; Wößmann, Lüdemann, Schütz, & West,

2007). It is in this context that this dissertation probes the relationships between student

achievement and teacher monitoring and evaluation in 21 countries.

2

Statement of Purpose

As stated above, the quality of teachers is of paramount significance with regard to

improving student achievement (Barber & Mourshed, 2007; Borman & Kimball, 2005).

This means that if teachers are one of the most significant determinants in schooling,

enhancing their impact on student achievement becomes a relevant educational and

scholarly pursuit. In the same vein, teacher evaluation as a strategy to enhance teacher

impact on student achievement renders itself for scholarly scrutiny. In other words, how

teacher evaluation practices and purposes correlate with student achievement becomes a

legitimate concern and area of interest for the larger public, parents, legislators, researchers,

and policymakers. It is in this regard that this study explores teacher evaluation practices in

select Organization for Economic Cooperation and Development (OECD) and non-OECD

countries with the intent to identify nuances of practices and purposes of assessing teachers

and how these practices and purposes relate to student achievement as reflected in student

test scores in mathematics, science, and reading.1

Teacher evaluation, which is synonymous with teacher appraisal (the two terms will

be used interchangeably throughout this study), can be construed of as performance reviews

conducted in schools by personnel such as principal, administrator, supervisor, senior staff,

or a person authorized as evaluator by an external agency such as a ministry of education.

“The results of appraisals may be used formatively to identify specific needs for

1 In the context of this study, student achievement as a dependent variable should be

construed as student test scores in the Program for International Student Assessment (PISA)

in the three subjects of mathematics, science, and reading.

3

professional development, or summatively for decisions related to promotion, rewards or

sanctions” (Looney, 2011, p. 442).

Within schools, principals and peers play a significant role in teacher evaluation.

They evaluate teachers using instruments such as classroom observations and student

achievement including student test scores and give feedback and arrange for reflective

sessions to deliberate on successes or failures of observed lessons and lesson plans.

Accordingly, an improvement strategy is prepared. Externally, the external evaluator may

conduct teacher evaluation using a variety of tools and means such as student test scores and

classroom observations. This type of evaluation has mostly an “accountability” focus

(Looney, 2011).

Thus, teacher monitoring and evaluation has two broad purposes: 1) developmental

purposes to develop teachers professionally, and 2) high-stakes purposes to make teachers

accountable for the quality of their practice (Danielson & McGreal, 2000). This study

explores the relationships between these two distinct but overlapping approaches of teacher

evaluation and student achievement as reflected in the Program for International Student

Assessment (PISA) test scores in mathematics, science, and reading in lower secondary and

secondary schools in 21 countries. These countries are Australia, Austria, Belgium (Fl.),

Brazil, Bulgaria, Denmark, Estonia, Hungary, Iceland, Ireland, Italy, Korea, Lithuania,

Mexico, Norway, Poland, Portugal, Slovak Republic, Slovenia, Spain, and Turkey. As will

be described in the section on sample in chapter 3, these countries make up the bulk of the

sample in the Teaching and Learning International Survey (TALIS) with three non-OECD

and 18 OECD countries. The study takes stock of the principals’ classroom observation

practices as part of her/his pedagogical roles and responsibilities as well as formal and

4

informal approaches of monitoring and evaluating teachers in schools. It focuses on other

means of monitoring and evaluating teachers such as through peer reviews, public

accountability and recognition, and by using student assessments and achievement. The

study also explores the relationships between consequences of teacher evaluation practices

for teachers and how these consequences relate to student achievement.

Significance of the Study

This study is significant for three reasons. First, it adds to the evolving

understanding of the factors that are critical in affecting student achievement in important

ways. The enormity of the task to establish all causal factors notwithstanding, significant

efforts have been made in identifying key factors associated with student achievement in

schools. Various studies have explored student achievement using predictors related to

individual students, their home and family backgrounds, and schools. Be it quality of

educational resources (Demir, Ünal, & Kılıç, 2010), student, family, and school

characteristics (Beese & Liang, 2010; Fuchs & Wößmann, 2007; Wößmann, 2003), or

immigration status of students (Zhang & Lee, 2011), researchers have uncovered important

dynamics that undergird student achievement in key subject areas of science, mathematics,

and reading. Among the plethora of factors, teacher monitoring and evaluation with

different purposes and approaches has been found to relate to and/or affect student

achievement in significant ways. Specific studies (e.g., Holtzapple, 2003; Milanowski,

2004; Taylor & Tyler, 2011; Schütz, West, & Wößmann, 2007; Wößmann et al., 2007) have

found correlations and causal connections among different aspects of teacher evaluation and

student achievement. Previous studies (e.g. Schütz, West, & Wößmann, 2007; Wößmann et

al., 2007) have used older PISA datasets and have explored teacher monitoring and

5

evaluation from an accountability perspective. This dissertation adds to the body of research

on teacher monitoring and evaluation practices using the latest PISA dataset available in the

public domain. It focuses on both the developmental and high-stakes approaches to teacher

monitoring and evaluation by operationalizing key constructs of the process in the light of

relevant theoretical and empirical literature.

Second, the study is unique in one key aspect. It uses PISA in combination with

information from the TALIS 2008 published by the OECD. The study uses secondary

findings as country variables from the TALIS 2008 as reported by the OECD in

combination with student and school level variables from the PISA 2009. The combination

of the two surveys generates a rigorous dataset that takes into account perspectives from

both the principals and teachers on teacher monitoring and evaluation practices in the

sample countries. More on this combination of the two surveys is discussed in the section on

datasets in the methods chapter.

Third, the study is significant because teacher evaluation and accountability are

gaining momentum in schools around the world as means to promote educational

excellence. With this push for evaluation and accountability, differences have surfaced

where key stakeholders such as teachers, policymakers, and administrators, though agreeing

that teachers should be evaluated in schools, are at odds with each other “over” or “about”

how best to do so in ways that can garner optimal student achievement.2 Thus, in the

2 Teachers protesting over bargaining issues in contracts in Chicago in 2012

(http://www.chicagotribune.com/news/local/breaking/chi-strike-updates-pickets-up-as-

more-talks-scheduled-20120910,0,3326359,full.story), and educators boycotting

6

atmosphere of current debates on and endeavors to improving teacher monitoring and

evaluation systems, identifying best practices to effectively monitor and evaluate teachers

has been a key concern for countries around the world (Isoré, 2009; OECD, 2010a). The

evidence gathered through this study provides additional insights to inform such debates

where key stakeholders are engaged in designing the best alternatives for monitoring and

evaluating teachers in their respective contexts.

Research Questions

This study explores the relationships between student achievement and teacher

monitoring and evaluation practices and purposes in 21 countries. The study specifically

attempts to answer the research question: How do teacher monitoring and evaluation

practices and purposes associate with student achievement in mathematics, science, and

reading in lower secondary and secondary schools? In particular, the study focuses on the

following three sub-questions:

RQ1: What is the relationship between the developmental approaches of teacher

monitoring and evaluation and student achievement?

RQ2: What is the relationship between the high-stakes approaches to teacher

evaluation and student achievement?

RQ3: How do teacher evaluation approaches interact with the other aspects of

schooling in relation to student achievement?

standardizing testing in Seattle (http://www.fairtest.org/seattle-teachers-boycott-tests) in

early 2013 are an illustration of that difference.

7

Teacher Evaluation: Unpacking the Constructs

Teacher evaluation is an eclectic term entailing a number of constructs, concepts and

approaches. The term “evaluation,” like many other value-laden constructs, is characterized

by various misperceptions that emanate from a host of synonymous but often different

concepts and processes. For example, some of the processes that may be confused with

evaluation are “assessment,” “supervision,” “accountability,” and “monitoring.” However,

the terms are different in scope and focus. Anomalies in the use and understanding of these

terms arise due to the fact that these concepts and processes share many similarities, but

they do not necessarily lead to similar outcomes. In this regard, it will be relevant to include

here an explanation of the distinctions and similarities among evaluation, assessment,

monitoring, supervision, and accountability.

Evaluation and assessment. Evaluation and assessment are related but different

processes with different purposes. Both processes involve elements of measurement.

However, “Assessment involves merely the measurement of an input, process, or outcome”

(Carlson & Park, 1976, p. 6). Evaluation also involves measurement of an input, a process

or an outcome, but more than measurement it leads to a value judgment of how well and to

what extent the input, process, or outcome has achieved its anticipated objective. In other

words, evaluation leads to an action that is intended “…to maintain, change, increase, or

decrease a behavior…” (Carlson & Park, 1976, p. 6). Evaluation leads to a change in the

elements of inputs, processes, outcomes, or a combination of these so as to create optimal

conditions where the desired behavior or output is maximized.

Evaluation and monitoring. Evaluation and monitoring, like evaluation and

assessment, are two related and overlapping processes in an organization. Monitoring,

8

almost invariably, is a process that accompanies evaluation and has a largely developmental

purpose attached to the process. Monitoring is the ongoing analysis of a process in relation

to set goals and objectives. United Nations Development Program (UNDP) defines

monitoring “…as the ongoing process by which stakeholders obtain regular feedback on the

progress being made towards achieving their goals and objectives” (p. 8). Thus, monitoring

involves tracking the progress and developing strategies to create the optimum momentum

to achieve best results around set goals and objectives (UNDP, 2009). In other words,

monitoring is an ongoing process whereby the data is systematically collected and analyzed

for making necessary adjustments on the way (Development Assistance Committee [DAC],

n.d.). On the other hand, evaluations are periodic reviews (mostly mid-term or end of term)

and analysis of the effectiveness of how a process or program has achieved its intended

objectives. Evaluations are followed by significant adjustments as per the outcomes of

evaluation. It needs to be noted that evaluations make significant use of the data and

findings from the monitoring activities. In sum, monitoring, like evaluation, involves

decision-making albeit in an ongoing and developmental fashion. In this sense, monitoring

is a “developmental” activity and it has been considered accordingly in the context of this

study.

Evaluation and supervision. These two processes can be considered in terms of the

management of personnel by a principal in a formal organization like school. Supervision

broadly entails administration of a unit of organizational activity where the purpose is to

ensure behavior of the supervisees as per the organizational goals, standards, and

procedures. At the same time, more than just overseeing a unit of organizational activity,

supervision entails “…cheer-leading, facilitating, and problem solving” (Saphier, 1993, p.

9

9). Evaluation is an added responsibility of the “cheer-leader” whereby s/he not only

oversees and monitors, but s/he also makes decisions on the efficacy of the behavior and

sometimes remediates and dismisses if need be (Saphier, 1993). Principals observing

classes, giving feedback to teachers, facilitating teachers to grow professionally, and making

decisions on staffing and other administrative matters, are some of the approaches through

which the former deliver their role as internal evaluators in schools. It is with these

theoretical underpinnings that this study explores principals’ evaluative focus in their

pedagogical roles as a category under the “developmental” approaches to monitoring and

evaluating teachers.

Evaluation and accountability. As defined above, evaluation is a judgment or

valuation of an input, a process, or an outcome. Accountability involves the additional step

of informing relevant stakeholders on the efficacy of an intended outcome. Accountability

aims at holding answerable those who are responsible for the outcome. Bovens (2005)

counts accountability as an obligation in a social setting wherein one actor is responsible for

his/her conduct in relation to another through a binding contract. In this sense,

accountability connotes answerability of one stakeholder (a group of actors and/or the whole

organization) to another with direct consequences in the process (Levitt, Janta, & Wegrich,

2008). Thus, accountability leads to an action leading to positive or negative consequences

as per the behavior of the actor(s) involved (Levin, 1974; Levitt et al., 2008). These

consequences are often high-stakes in nature where one’s services, remuneration, and

professional image are on the line.

10

Teacher Evaluation: Purposes, Approaches, and Outcomes

Teacher evaluation in plain terms is measuring and judging the value of teacher

effectiveness and taking steps so as to maximize positive effects of teachers and teaching on

student learning. Broadly speaking, teacher evaluation has two main purposes— formative

or developmental purpose and high-stakes or accountability purposes (Danielson &

McGreal, 2000; Haefele, 1993). The high-stakes purposes of evaluation have the intended

objective of holding teachers answerable for the quality of their professional practice

(Haefele, 1993; Isoré, 2009). This focus of evaluation is also concerned with critical

decisions on a person’s employability, career advancement or, in extreme cases, relieving

someone of his/her services for a lack of needed competencies (Scriven, 1981).

In contrast, the developmental purposes of teacher evaluation, including monitoring

as explained above, aim to identify professional training needs of the evaluated teachers so

as to improve their practice (Haefele, 1993; Latham & WexIey, 1982). Such professional

development aspects may include:

…regular feedback by the principal and experienced…to identify priorities for both

teacher and school improvement. Results from this kind of teachers' assessment can

be used to identify teaching needs and contribute to the definition of the school plan

in order to improve the teaching process within the school. (Faubert, 2009, p. 29)

It needs to be noted that while teacher evaluation with a developmental focus has its

ultimate purpose as improving instructional practice of teachers, schools may use insights

gained through such evaluations for high-stakes decisions as well (Isoré, 2009). Also,

schools may institute a developmental evaluation system to ensure proper implementation of

a school’s policies as regards instructional objectives such as attaining best results in

11

standardized tests by making teachers teach aspects of the curriculum that can promote

higher scores for students. Thus, the two purposes of evaluation may not always be cut-and-

dried, working in isolation. Both may interact in complex ways with each other and with the

other aspects of schooling depending upon the goals of a particular school and the overall

policy-environment at the local, regional, national, and even international levels.

Instruments and Evaluators

Schools evaluate teachers using a variety of instruments, evaluators, and approaches.

Instruments may consist of classroom observations with simple to complex checklists and

rubrics, teacher portfolios, peer reviews, teacher tests and interviews, student achievement,

and questionnaires and surveys (Isoré, 2009). A discussion encapsulating the whole range of

evaluation instruments and approaches will be too exhaustive and beyond the scope of this

study since the study limits itself to only those evaluation instruments that are covered in the

PISA 2009 survey. Therefore, I will include here only a discussion of classroom

observations, peer reviews, and student achievement data as used by different evaluators as

measures of teacher monitoring and evaluation. It needs to be noted that this description is

not a critique of these instruments or evaluators. It is, rather, an attempt to explain what

these instruments and evaluators are and how they are used in schools for teacher evaluation

purposes.

Student achievement. As the name suggests, student performance in various types

of assessments (internal and external—standardized or unstandardized) provide a

convenient form of evidence to assess the value-added into student learning by teacher(s).

Student achievement data can be described in a variety of ways such as averages,

percentages, subject means, class means, and overall school means and so on and so forth

12

(Peterson, 2000). One use of the achievement data in any evaluation approach is through the

Value Added Models (VAMs) that claim to tease-out individual teacher contributions in a

students’ learning by clearing out the noise in the data after controlling for a student’s

previous background and various other teacher and school characteristics (Stronge &

Tucker, 2000). Other less sophisticated uses of student achievement may be in the form of

averages and percentages at the subject, classroom, school, regional, and national levels.

Teacher peer reviews. This category consists of the assessments by subject

colleagues who may or may not work in the same school, may observe classes, give

feedback, and have review and reflective sessions with teachers so as to offer suggestions

for improvement (Looney, 2011). This may also consist of review of materials “…in which

teachers…examine and report on instructional materials, classroom artifacts, and student

work assembled by a teacher” (Peterson, 2000, p. 94). Peer reviews can be used for

developmental purposes or as adjuncts to the formal evaluations for high-stakes purposes

(Looney, 2011).

Classroom observations. Peterson (2000) calls classroom observations as

“systematic observations” where the purpose is to document the instructional processes in

classrooms which can then be turned into “…numerical summaries of distributions,

graphical displays, and prose descriptions” (p. 96). Highlighting the developmental utility of

classroom observations, Evertson and Holley (1981) note that classroom observations “help

in understanding and ultimately in improving instruction…” (p. 90) by providing the

opportunity to observe the interactions between teachers and students that are significant in

determining what goes into student learning. Classroom observations also allow seeing if

“…the teacher adopts adequate practices in his more usual workplace: the classroom

13

(United Nations Educational, Scientific and Cultural Organization [UNESCO], 2007 cited in

Isoré, 2009). In terms of its prevalence, Isoré (2009) shows in her review of literature that

classroom observation is the most used source of evidence in teacher evaluation across

OECD countries. Likewise, for its ubiquity in schools, Danielson and McGreal (2000) liken

classroom observation to teacher evaluation and count it as “…the best, and the only, setting

in which to witness essential aspects of teaching—for example, the interaction between

teacher and students and among students” (p. 47).

Evaluators. Like the evaluation instruments, there are numerous evaluators that

carry out the function of evaluations in schools. For the purposes of this study, two forms of

evaluations are significant: internal and external. Internal evaluations which Isoré (2009)

also calls internal reviews in OECD contexts, are mostly carried out by principals or senior

personnel (by senior teachers or other administrators) within schools (Peterson, 2000). In

most of the OECD countries, internal evaluations are carried out by the principal or senior

staff (Isoré, 2009). External evaluations or external reviews, on the contrary, are carried out

by personnel from outside the school who may come from other schools or an education

agency external to the school. These external evaluators may exclusively be “external” or

may also include school principals as part of the panel depending upon the country and its

policies (Isoré, 2009).

This dissertation is divided into five chapters. Chapter 2 gives a detailed review of

literature that essentially delineates teacher evaluation practices in different countries (with a

special focus on OECD countries) and synthesizes empirical evidence on the relationships

between teacher evaluation and student achievement. The chapter closes with a description

of the theoretical framework and hypotheses of the study. Chapter 3 describes the methods

14

and the datasets used in the study. It describes and explains data management, processing

and analysis. In the fourth chapter, results and findings of the study have been presented.

The last chapter consists of a discussion of the major findings of the study. The chapter ends

with a discussion of the limitations of the study, policy implications, and recommendations

for future research.

15

Chapter 2. LITERATURE REVIEW, CONCEPTUAL FRAMEWORK, AND

RESEARCH HYPOTHESES

This chapter is divided into three sections. Section 1 lays-out an outline of teacher

evaluation in OECD and non-OECD countries. Since a significant portion of the sample of

this study consists of OECD countries, it is plausible to describe teacher evaluation scenario

as captured in the various reports from the OECD. Section 2 describes and explains

empirical evidence relating to the relationships between student achievement and teacher

evaluation practices and purposes. Building on prior evidence, the Section 3 gives a

conceptual framework and research hypotheses of the study.

Teacher Monitoring and Evaluation in Cross-National Perspectives

Various studies and reports from OECD show that there is variation both within and

among countries as regards teacher monitoring and evaluation practices (OECD, 2005,

2009a, 2009b, 2010a). This variation can be seen in the purposes and practices of teacher

evaluations (OECD, 2010a). The variation is also marked by a shift in several countries

towards teacher evaluations that have a predominant focus on teacher development and in-

service trainings (Faubert, 2009; OECD, 2005).

With regard to variations across countries, teachers are held accountable as teams

(e.g., in Scotland and Sweden), and sometimes as individuals to incentivize them (e.g., in

Hungary) by using pupil achievement as an evidence of teacher performance in internal

and/or external evaluations of teachers (Faubert, 2009). Finland, like Greece and Israel, does

not have a state-mandated evaluation system thereby rendering a greater degree of

autonomy to the principal who is solely responsible for school affairs including teacher

monitoring and evaluation (UNESCO, 2007 cited in Isoré, 2009). The United States, on the

16

other hand, has a variety of internal and external teacher evaluation practices such as the

National Board for Professional Teaching Standards (NBPTS) certification, and Praxis III

examinations (OECD, 2009a). These practices include developmental approaches such as

classroom observations, teacher portfolio reviews, teacher interviews, and assessment of

content and pedagogical knowledge. These evaluations also have high-stakes aims such as

to judge teachers for their eligibility for tenure or certification (OECD, 2009a). Examples of

such evaluations can be found in states like North Carolina, Connecticut, and California

(Larsen, 2005). In Chicago, principals use observation check-lists to rate teachers’

performance and to identify strengths as well as areas for improvement with an end of year

rating of teacher performance (Sartain et al., 2011).

We find similar approaches to evaluation in specific regions in Canada, England, and

Australia. In Ontario, classroom teachers who are experienced are evaluated using

descriptors of teaching skills, content knowledge, and requisite attitudes towards teaching.

These evaluations normally consist of classroom observations by principals and discussion

sessions before and after the classroom observations (Larsen, 2005). In addition to the

classroom observations, other sources of evidence on teacher performance such as lesson

plans, student records, self-assessment reports, and parental and student surveys make-up

the whole gamut of teacher evaluation package (Larsen, 2005).

Teacher evaluation as covered in the OECD project 2002-04. The OECD

conducted a study in 2002-04 to give country backgrounds on various educational policies

including teacher evaluation.3 Appendix A gives a summary of the evaluation scene in the

3 Contents in this section are adapted from OECD (2005).

17

26 countries involved in the project. The findings show that, as of 2002-04, the OECD

countries employed a range of teacher evaluation systems with varied criteria, tools,

purposes, and consequences (OECD, 2005). According to the OECD report on the project,

around half of the countries had “…periodic evaluation as part of their regular work”

(OECD, 2005, p. 188). Six of the twenty-six countries had no developmental focus in

teacher evaluations. Nine countries had active teacher evaluation systems having links with

teachers’ professional development. The remaining countries had varied responses

depending upon the type of evaluation and incentive system as well as the location within

the country. For example, in Chile, three of the four evaluation systems had professional

development as one of their purposes. The report also identifies Chile as a more progressive

member in the list of countries in implementing a variety of teacher evaluation systems

having both the developmental and high-stakes purposes. In the United States, compulsory

training was observed as a general trend in evaluation schemes. In the countries that had

links between teacher evaluation and professional development had some consequences for

ineffective teachers. These consequences included implementation of an improvement plan

and deferral of promotion or loss of salary. Countries like Austria, Canada (Quebec),

Denmark, Finland, Germany, Greece, Israel, Italy, and Spain had teacher evaluations mostly

conducted for non-tenured teachers. Ireland, Norway, and Sweden were characterized by

school evaluation more than teacher evaluation. Hungarian schools had most of the

responsibility for teacher evaluation left with the school principal while Mexico had

voluntary teacher evaluations (OECD, 2005).

18

Teacher evaluation: Findings from the PISA 2009. The PISA 2009 survey

included a number of items related to evaluations in schools.4 While most of these items

sought information on the uses of student assessment and achievement data from an overall

school perspective, a few items specifically asked principals about using student assessment

and achievement data for instructional purposes, to monitor and evaluate teachers, and for

accountability purposes. Responses were recorded on a range of options such as informing

parents about their children’s progress, identifying areas for improvement in the curriculum

or teaching methods, and judging teacher effectiveness (in test language).5 The survey also

asked principals about their management roles that included if and how often they observed

teachers in classroom, if they suggested teachers for professional improvement, and if they

informed teachers of opportunities for updating their knowledge and skills. All these aspects

of principals’ role carry the elements of internal school evaluations with a developmental

purpose attached to the process.

The report shows that countries varied greatly in terms of the purposes of uses of

student assessments and achievements. Items related directly to teacher evaluation and

accountability included use of achievement data for the purposes of monitoring teachers and

4 Contents in this section are adapted from OECD (2010a).

5 According to PISA standards for language of testing, “The PISA test is administered to

a student in a language of instruction provided by the sampled school to that sampled

student in the major domain (Reading) of the test” (PISA, 2012, p. 370). Therefore, in the

remainder of this dissertation, language of instruction in reading is referred to as “test

language” to keep the term consistent with the PISA.

19

judging their effectiveness. According to OECD (2010a), on average, 59% of students

across OECD countries studied in schools that used student achievement to monitor

teachers. Countries like Poland, Israel, the United Kingdom, Turkey, Mexico, Austria, and

the United States reported having 80% of the students attending such schools. A number of

countries used student achievement data in combination with internal assessments by

principals, peers, senior staff, and/or external evaluators. Finland had much less internal

assessments and observations of teachers and external evaluation was almost non-existent

(only 2% students studied in schools having external evaluations).

A second item on the developmental uses of student achievement data included

identifying aspects of instruction or curriculum for improvement purposes. Though this item

did not specifically ask principals if the use was for improving teachers’ instructional

practice, it can be implied that “instruction” being the main job of teachers, it covered

aspects related to professional development of teachers. The report showed that schools

using this practice had an average of 77% of students enrolled across OECD countries. New

Zealand, the United States, the United Kingdom, Iceland and many other countries had a

much higher prevalence: more than 90% students were enrolled in schools that used this

practice. Greece and Switzerland had less prevalence of this practice.

Some of the indirect measures of teacher evaluation and accountability related to the

overall (as teams or school) accountability and evaluation processes in schools. Such

indirect high-stakes purposes of teacher evaluation included public accountability,

informing parents, comparisons and benchmarking across schools and districts and at

national level, and administrative tracking by an external authority. On public

accountability, OECD (2010a) reported that an average of 37% students attended schools

20

that had public accountability. Such public accountability included making student

achievement data open to the public through media, organizational websites and other

channels. In Belgium, Finland, Switzerland, Japan, Austria, and Spain, this practice was far

less common. The United States and the United Kingdom had over 80% students in schools

with public accountability practices.

A related but different aspect of teacher accountability was sharing of student

progress with parents. On average, 52% of students across OECD countries studied in

schools where parents were provided with information on their children’s academic

performance. Countries like Austria, Italy, and the Netherlands had 80% students studying

in such schools. Administrative tracking of student achievement was in place in OECD

countries with an average of 66% of students attending schools with this practice. The

United States, the United Kingdom, and New Zealand were exceptional in this case as more

than 90% of students came from schools having this practice.

Using achievement data for instructional resource allocation was found in schools

having 33% of student population across OECD countries. This figure was 70% for Israel,

Chile, and the United States and less than 10% in Iceland, Greece, Japan, the Czech

Republic, and Finland.

In addition to providing description of evaluation and accountability in schools in

countries covered by the PISA 2009 survey, OECD (2010a) also classified countries into

four categories. It used principals’ responses on various aspects of their schools’ evaluation

and accountability practices and purposes and, through a “latent country profile analysis,”

classified countries on the basis of use of achievement data for “benchmarking and

information purposes,” and if the data were used for various types of “decision-making in

21

schools.” Appendix B gives the details of these categorizations. In this profile analysis,

countries that heavily monitored teachers’ practice (such as Australia, Canada, and Chile)

also had arrangements for sizeable public accountability, administrative tracking, and

monitoring yearly progress. Sixty-five percent of the use of student performance and

assessment was for monitoring teachers in these countries. In contrast, countries with least

emphasis on teacher monitoring also had less emphasis on public accountability,

administrative tracking, and informing parents about their children’s progress. These

countries included Austria, Belgium, Finland, Germany, and Greece. Countries with lesser

monitoring of teachers included Hungary, Norway, Turkey, Montenegro, Tunisia, and

Slovenia. These countries, however, had a higher focus on high-stakes consequences such as

public accountability and other external accountability measures in schools. Like Australia,

Canada, and Chile, countries such as Denmark, Italy, Japan, Spain, Argentina, Macao-

China, Chinese Taipei, and Uruguay frequently used achievement data to monitor teachers’

practice. However, these countries had lesser external accountability focus unlike Australia,

Canada, and Chile.

Teacher evaluation: Findings from the TALIS 2008. A cross-national review of

teacher evaluation systems provided so far is based on two key OECD reports published in

2005 and 2010. These two sources exclude perspectives of the key target of monitoring and

evaluation who are teachers in the context of this study. This, somehow, leads to an

incomplete scenario of teacher evaluation in schools as captured by the two sources.

However, a more comprehensive picture of teacher monitoring and evaluation emerges from

the TALIS that was conducted by the OECD in 2008. This survey was administered both to

the teachers and principals in lower secondary schools in the participating countries. The

22

TALIS is comprehensive in its coverage on key aspects of teacher evaluation practices that

are significant in terms of developmental and high-stakes purposes of the process.6 For

example, the report gives detailed analyses of how performance appraisal and feedback are

built into the evaluation systems, how much emphasis is placed on professional

development in teacher appraisal and feedback, and how important is a given evaluation

criterion, for example student test scores, in teacher appraisal in each TALIS country. The

report also shows how internal and external evaluations are conducted in schools in these

countries.

A key finding of the survey is the nature of internal and external evaluation and

feedback in the TALIS countries. On internal and external evaluation and feedback, the

TALIS shows that the sources of appraisal and feedback are usually found within schools

since more than 50% of teachers reported not having experienced external evaluation and

feedback in the last five years. This indicates that teacher evaluation is situated

predominantly within schools across the OECD countries thereby making internal

evaluation practices (such as by principals and peers) an important element to probe.

According to OECD (2009b), majority of the countries covered in the TALIS 2008

used student test scores as criteria for teacher appraisal and feedback. Across the TALIS

countries, more than 50% of the criteria for teacher appraisal and feedback consisted of

student test scores (see Appendix C). Few countries had this criterion at less than 50% with

Denmark having about 29% of the teacher appraisal and feedback criteria consisting of

student test scores. With slight variations, countries having lesser focus on student test

6 Contents in this section are adapted from OECD (2009b).

23

scores also had lesser emphasis on innovative methods in teaching. On average, teachers

accorded highest importance (73%) to within classroom processes as criteria in their

evaluations.

One of the significant aspects of any teacher evaluation mechanism is the end result

of it. The end result can be seen through how much a teacher evaluation process is having an

impact on classroom teaching and other aspects of teachers’ professional lives. The TALIS

2008 survey captured information on these important elements of appraisal and feedback

(see Appendices D & E). In terms of the impact of teacher appraisal on teaching in TALIS

countries, teachers reported on the extent to which their appraisal changed various aspects of

their lives in schools (see Appendix D). Teacher responses showed that the greatest

emphasis (an average of 41%) was placed on raising student achievement in the form of

student test scores. Australia, Brazil, Bulgaria, Ireland, and Italy were some of the countries

with a heavy emphasis on student test scores in teacher appraisal systems. In addition,

classroom management, instructional practices, and developing professional development

plans were the next areas that teachers showed as receiving the highest impact in their

appraisals. In countries like Australia, Belgium (Fl.), Bulgaria, Hungary, Ireland, Mexico,

Norway, Slovenia, and Spain, teachers reported classroom management as one of the most

affected areas of their work (OECD, 2009b).

The TALIS 2008 gives insights into outcomes of teacher appraisal and feedback (see

Appendix E). In addition to the impact on teaching practices and skills, teachers also

reported on how their appraisal changed their service and salary structures. Some of the

outcomes that the OECD (2009b) report mentions are a change in financial incentives,

opportunities for professional development, and change in responsibilities. Analysis of such

24

outcomes shows that few teachers reported any direct monetary outcomes or long term

career advancement as a result of their appraisals. On professional development as an

outcome of teacher appraisal, Bulgaria, Lithuania, Poland, and Slovenia showed a greater

focus on the developmental purposes of teacher evaluation. Mexico, Bulgaria, Brazil,

Poland, and Lithuania were some of the countries where teachers reported that teacher

appraisal and feedback had a higher impact on their teaching practice. At the same time,

these countries also had teachers in higher percentages who emphasized improving student

test scores in their teaching. Countries like Denmark, Austria, and Belgium (FL) had much

less emphasis (around or less than 20%) on improving student test scores in their teaching

and a development plan for improving practice. One of the least affected areas of their

service was a change in salary and if any financial bonus was awarded to teachers. For

example, only 0.4% of the teachers in Flemish Belgium reported that their appraisal led to a

moderate or large change in their salary. In contrast, 33% teachers in Malaysia reported a

moderate to large change in their salary as a result of their appraisal. There was a high

correlation between how teacher appraisal affected teachers’ salary and financial rewards or

bonuses. The highest impact of teacher appraisal on any aspect of teachers’ lives as reported

by teachers was observed in Malaysia. Countries like Australia, Austria, Belgium (Fl), and

Malta, on average, showed lesser change in any aspects of teachers’ lives as a result of

appraisal and feedback.

According to OECD (2009b), 62% of the principals in TALIS countries shared

results of appraisals with teachers. In Australia, Austria, Belgium, Bulgaria, Estonia,

Hungary, Poland, and the Slovak Republic, over 75% of the teachers worked in schools

where principals reported that they communicated results of the appraisals to teachers most

25

of the time. This percentage was 32 in Korea and 25 in Turkey. Furthermore, on average in

TALIS countries, most of the teacher appraisal happened within schools with very limited

reporting of underperformance to an outside authority. Only in Austria, Mexico, and Brazil

was such a reporting more common with 21%, 47%, and 27% respectively. Principals who

reported that they never established an improvement plan in case of identification of a

weakness ranged from 11% in Poland and Estonia, to 23% in Austria.

While this section has set a background to teacher evaluation practices and purposes

in the OECD countries, the next two sections provide empirical evidence on how teacher

evaluation is linked to student achievement in schools.

Teacher Evaluation: Empirical Evidence

As Isoré (2009) mentions, teacher evaluation purposes—developmental or high-

stakes—do not always work in isolation. A teacher evaluation system may simultaneously

carry both the “developmental” and “high-stakes” purposes. Also, schools use insights

gained through the “high-stakes” approaches for “developmental” purposes and vice versa.

This crisscrossing of teacher evaluation purposes and practices offers an immense challenge

when categorizing literature into distinct themes of purposes and practices. However, as an

arbitrary arrangement and for the sake of simplicity, I have categorized empirical evidence

on teacher evaluation into two streams based on how evidence on teacher performance is

gathered in schools. If, in a given piece of empirical literature, the predominant mode of

gathering evidence on teacher performance was through instruments such as classroom

observations focusing on within-classroom “processes,” and if teachers received feedback as

part of their evaluations, I have included that piece of literature under the discussion on the

“developmental” approaches. On the contrary, if the predominant approach to gathering

26

evidence on teacher performance was through student achievement chiefly in the form of

test scores with the purpose of making teachers accountable for their practice, I have

grouped such literature under the “high-stakes” approaches to teacher evaluation.

Thus, this literature review presents empirical evidence on teacher evaluation in two

broad streams. The first stream (e.g., Goe, Bell, & Little, 2008; Sartain et al., 2011;

Wenglinsky, 2002) consists of studies that explore standards-based approaches such as

classroom observation instruments and rubrics as well as subjective modes of teacher

evaluation. This stream explores standards-based and subjective teacher evaluation practices

with or without student test scores as measures of teacher performance. The second stream

(e.g., Goldhaber & Hansen, 2010; Sanders & Horn, 1994; Stronge & Tucker, 2000) consists

of literature on teacher evaluation approaches that use student test scores as a primary

measure of teacher performance. These teacher evaluation approaches may not necessarily

carry “developmental” aspects and almost always carry high-stakes purposes to make

schools and teachers accountable for their performance. Additionally, a limited amount of

empirical evidence also discusses effects on student achievement of interactions between the

different teacher evaluation approaches and other schooling aspects.

Developmental teacher evaluation and student achievement. Teaching is a

complex social process and accordingly it requires complex approaches to assessing its

quality. In this regard, a substantial amount of empirical evidence explores standards-based

(and subjective evaluations), developmental approaches to evaluate teachers.

Emphasizing the importance of teacher evaluations for teacher development and

mainly responding to the issue of smaller school effects in quantitative studies on student

achievement compared to student background effects, Wenglinsky (2002) posits that the

27

quantitative research has often lagged behind in tapping into the huge potential of

explanatory power of the processes going on in classrooms. In this regard, to the extent of

the void in quantitative realm of educational research around what happens in classrooms,

Wenglinsky’s (2002) study is a significant step forward in driving quantitative research to

explore complex processes of assessing teachers’ practice and identifying their professional

development needs. His study was made feasible, as he mentions, by the availability of a

large database—the National Assessment of Educational Progress (NAEP)—that consists of

information covering aspects of classroom practices along with student, teacher, and school

level characteristics. His primary objective was to test the generalizability of insights that

the qualitative research provided on such subtle aspects of teaching and learning as

understanding and thinking skills of students. He refers to only two sources of literature

(National Center for Education Statistics [NCES], 1996; Cohen & Hill, 2000) that discussed

within-classroom aspects of teaching and learning using quantitative analysis of a large

dataset NELS:88. Building on these earlier studies and using a multi-structural equation

modeling (MSEM) approach, he finds that the effects of teaching quality as reflected in a

teacher’s classroom practices such as a focus on higher order thinking skills and pushing the

bar up for students were as strong, if not more, as other school level factors. Thus, his study

appears to be a significant push for quantitative studies that focus on teacher evaluation

approaches that are developmental in nature and that are deeply connected to classrooms

through such instruments as classroom observations. Following Wenglinsky (2002), we see

many studies (Kimball, White, Milanowski, & Borman, 2004; Holtzapple, 2003;

Milanowski, 2004; Sartain et al., 2011; White, 2004) that explore teacher evaluation

28

practices that focus on within-classroom processes with the purposes of assessing and

developing teachers’ practice so as to improve student achievement.

Kimball, White, Milanowski, and Borman (2004) studied the relationship between

standards-based teacher evaluation scores awarded on the basis of the Danielson’s

Framework of Teaching and student achievement. Teacher evaluation based on Danielson’s

framework can be considered as one of the many approaches that can be used to formatively

evaluate teachers in order to improve their professional practice. This framework consists of

four domains: planning and preparation, classroom environment, instruction, and

professional responsibilities (Danielson, 1996). Each domain further carries 22 components

that describe teaching competencies required of a teacher. The framework rates teacher

performance at four levels: unsatisfactory, basic, proficient, and distinguished. Kimball et al.

(2004) found in their multilevel statistical modeling that though there were positive

significant relationships between teacher evaluation ratings and student achievement in all

subjects and grades that they tested, coefficients were not statistically significant in all

cases. However, only for reading in fourth grade and for each test in fifth grade, they found

positive significant coefficients. They conjecture that this situation may have resulted from a

mismatch between what is taught and what is examined in schools, in addition to the very

limited number of variables (only 7 out of 23) that they used as teacher evaluation scores in

their study. As the authors hint, this may have led to missing important information on

teacher performance in all teacher evaluations measures.

In contrast, Milanowski (2004), found small to moderate positive correlations in

each of the tested subject. His study was similar to Kimball et al. (2004) in the use of the

Danielson Teaching Framework. Though the relationships were at best moderately positive,

29

he still considered them significant given that measuring teacher effectiveness using

standards-based evaluation rubrics may be noisy and influenced by a number of other

confounding factors. These relationships represented the significance of teachers’ practice in

relation to student learning and hence, teacher evaluation using standards-based evaluation

frameworks were a viable alternative to evaluating teachers. Furthermore, a combined

analysis of studies conducted at three sites by Milanowski, Kimball, and White (2004)

showed that the standards-based teacher evaluations have “…substantial positive

relationship with the achievement of the evaluated teachers’ students” (p. 19). All this meant

that the developmental purposes of teacher evaluations were significant in improving quality

of learning for students.

On standards based approaches to teacher evaluation, Holtzapple (2003) carried out

his own analyses of the links between teacher evaluation scores in Cincinnati’s Teacher

Evaluation System (TES) and found similar results as that of Milanowski (2004). TES is an

adapted version of the Danielson’s framework (see Danielson, 1996) consisting of only 16

standards in the four domains of teaching (Holtzapple, 2003). However, Holtzapple’s

analysis showed that though the evaluation system successfully predicted performance at the

extremes of the ratings (unsatisfactory and distinguished), it did not effectively predict

student achievement at the middle (proficient and basic) level of teacher evaluation ratings.

Holtzapple (2003) used teachers’ evaluation score in the “Teaching for Learning” domain or

a composite of scores in all the four domains of the teaching standards. His analyses of

student gains and teacher evaluation scores showed that if teachers received “unsatisfactory”

and “basic” ratings on “Teaching and Learning Domain,” it negatively reflected on student

achievement as shown by a lower score relative to predicted score on the basis of prior

30

year’s achievement. Students taught by the “distinguished” teachers performed at the

expected level. He further mentioned that the TES was important in teachers’ professional

development as district providers aligned their training activities in line with the TES

standards and requirements. Also, teachers showed a change in their professional behavior

as they started reflecting on their practice in preparation for the TES. These are the aspects

of teacher evaluation that incorporate developmental purposes of teacher evaluation

(Danielson & McGreal, 2000).

Continuing the line of research on standards-based teacher evaluations, Sartain et al.

(2011) reported results for Chicago’s Excellence in Teaching Pilot, a program launched in

2008 to rebuild an effective teacher evaluation system. The program aimed at improving the

instructional quality by evaluating teachers’ performance and giving them constructive

feedback that targeted teachers’ professional development. Like the earlier studies on the

developmental approaches to teacher evaluations, school principals and external evaluators

in the pilot program observed teachers’ practice by using the Danielson Framework for

Teaching, and arranged for conferences to share with the teachers the outcomes of

evaluations. Their data consisted of extensive classroom observations by principals and

external evaluators (499 classroom observations for reliability check and 955 classroom

observations by principals alone for validity check), student achievement in value-added

frameworks, and interviews with teachers and principals on various aspects of teacher

evaluations including classroom observations. They found that the teachers who were

evaluated showed significant gains in the achievement of students whom they taught.

Teachers who participated in the qualitative part of the study also agreed that the evaluation

31

system had become more reflective, thereby leading to a significant improvement in their

practice.

Other studies (e.g., Taylor & Tyler, 2011; Tyler, Taylor, Kane, & Wooten, 2010)

focused on how classroom observations as instruments in developmental approaches to

teacher evaluation affected student achievement. These studies show that the classroom

observations (by observers such as principals, peers, and external evaluators) significantly

relate to student achievement. As Tyler, Taylor, Kane, and Wooten (2010) emphatically

stated:

…some of the strongest evidence to date that classroom observation measures

capture elements of teaching that are related to student achievement….moving from,

for example, an overall TES rating of “Basic” to “Proficient” or from “Proficient” to

“Distinguished” is associated with student achievement gains of about one-sixth to

one-fifth of a standard deviation. (p. 259)

Regarding the effects of classroom observations, Tyler et al. (2010) go deeper into

the dynamics of how various aspects of classroom observations predicted mathematics and

reading achievement. In their study, at a micro-level, a teacher who was able to manage a

better classroom environment compared to focusing on teaching practices showed increased

student performance by 0.25 standard deviation (SD) in mathematics and 0.15 SD in

reading. Their study also showed that a teacher who focused more on inquiry based teaching

compared to a teacher who focused on content produced larger gains in mathematics but no

effects in reading. Based on their findings, they posit that teachers may be making trade-offs

between various instructional objectives, as captured in various components of the

developmental evaluations of teachers. Thus, their study is significant in terms of the

32

elements of teaching that are important in raising student achievement in mathematics and

reading. Considering classroom observations mostly as part of the developmental teacher

evaluation practices and purposes, similar results appear in the analysis conducted by Schütz

et al. (2007) where they found that classroom observations by principal or senior staff

showed positive associations with student achievement.

Findings in Taylor and Tyler (2011) suggested that a student was predicted to score

higher (10% of an SD) in mathematics compared to a similar student who would have been

taught by the same teacher before the latter’s evaluation. One of the significant strength of

their design was the association that they established between a given teacher who was

evaluated before and after the year of evaluation rendering a higher internal validity to their

research design. They also controlled for various other factors important at student and

teacher levels such as a teacher experience, student gender and ethnicity, and previous

achievement. One interesting result that is particularly important for my study is their

finding that the effects of evaluation through a standards-based evaluation approach was not

the same across all the evaluated teachers. Teachers who received a lower score before

evaluation showed higher student performance after evaluation suggesting a potential

“developmental teacher evaluation” relationship that may be associated with incentives and

consequences in the evaluation system itself or the critical feedback that teachers received

during their evaluations. Taylor and Tyler (2011) indicate that the exact dynamics

undergirding such relations between student achievement and teacher evaluation were not

clearly manifest in their study. However, they associated the gain in student achievement to

the particular developmental features of evaluation system where teachers are provided

feedback on the skills that can have positive associations with student achievement. This

33

suggests potential benefits of the developmental teacher evaluation to enhance student

learning thereby rendering credibility and logic to the studies such as this one that aim to

explore the relationships between different evaluation purposes and approaches and student

achievement.

Wößmann et al. (2007) explored monitoring by principals, external observers, and

peers in mathematics. While I am analyzing these variables with a “developmental” lens

given the closer theoretical relevance of “monitoring” as a developmental activity (UNDP,

2009) in the larger teacher evaluation systems in schools, Wößmann et al. (2007) used more

of an accountability lens in analyzing findings around monitoring teachers in mathematics.

In their cross-countries accountability analysis, they found positive but insignificant

relationships of such monitoring with student achievement. However, such monitoring of

lessons turned significantly positive at school-level accountability at various significance

levels. In this regard, coefficients of monitoring by principals were more pronounced than

that of monitoring by an external authority, depicting a necessity to further probe principal’s

observation of teachers and resultant associations with student achievement. Wößmann et al.

(2007) show that the principals observing teachers’ lessons had positive associations with

student achievement with significant effects coming into play at 10.5% significance level.

Classroom observation and monitoring by external evaluators showed positive relations with

student achievement in some instances after controlling for principals’ monitoring of

lessons.

Gallagher (2004) explored a teacher evaluation system that had both the elements of

developmental and high-stakes approaches to assessing teacher effectiveness. A

predominant focus of the teacher evaluation system at his research site was assessing

34

within-classroom processes followed by feedback. Evaluators used a variety of approaches

to gathering evidence on teacher performance such as classroom observations, student work,

and lesson plans. The author mentions that while student work was sometimes used to

ensure documentation on teachers’ work, student achievement was not part of the formal

evaluation of teachers. In his study, Gallagher (2004) found strong and statistically

significant relationships between teacher evaluation scores and student achievement in

reading. The findings for mathematics were positive but statistically insignificant. Last but

not the least, Rockoff and Speroni (2010) studied subjective and objective measures of

evaluating teachers. They studied teacher evaluations conducted by professional mentors

who worked with the new teachers and who made evaluations based on student achievement

as a result of first year of teaching of these new teachers. Findings in their study showed

significant connection between student achievement and the evaluated teachers. Thus, in

sum, there is ample evidence that shows that the developmental approaches to teacher

evaluation can have significant positive associations with student achievement though some

studies also show statistical insignificance of such associations.

High-stakes teacher evaluation and student achievement. As stated previously,

one of the main purposes of high-stakes teacher evaluations is to judge teacher effectiveness

and make “consequential decisions” (Danielson & McGreal, 2000) relating to, for example,

personnel matters including hiring, firing, salary adjustment, and accountability. While there

is no literature that discusses high-stakes and developmental teacher evaluations in a

mutually exclusive fashion, the type of evidence used to assess teachers can be used as a

proxy to study such approaches in teacher evaluations. As described in the introductory

chapter, evidence of teacher performance can come in a variety of ways such as student test

35

scores, teacher peer reviews, and principal and staff observations. In the high-stakes

evaluations, a main source of evidence has been in the form of how well students perform in

various assessments. Student assessment and performance may come in a variety of forms

such as school-based tests and external standardized examinations.

Proponents (e.g., Sanders & Horn, 1994; Stronge & Tucker, 2000) contend that

student assessments as an evidence of teacher effectiveness offer good tradeoffs for their

objectivity. These proponents suggest using VAMs that apply pretest-posttest designs to

statistically isolate teacher effects on student achievement from other confounding factors

that emanate at student, school, and family levels (Astin, 1982; Sanders & Horn, 1994).

According to Astin (1982), the VAM approach,

….unlike traditional measures such as the reputational view, the resources view, or

the outcomes view, promotes equity because it diverts attention away from mere

acquisition of resources and focuses instead on their effective utilization. Any school

is capable of attaining a significant degree of "excellence" through this method.

(Astin, 1982, Abstract)

To explore the efficacy of student test scores as measures of teacher effectiveness,

Bingham, Heywood, and White (1991) studied student performance in a large school system

with around 100,000 students. They explored student performance of fifth graders to see if it

could be used as a measure to evaluate teachers in high-stakes evaluations. They examined

over 500 independent variables that could potentially be related to student performance in

different ways. Through a residual and step-wise regression analysis they identified the

schools wherein teachers had added value to the students whom they taught. They conclude

that student performance can be used to evaluate teachers since their experiment showed

36

that teachers could be differentiated on the basis of how well they added value in student

learning. They state, however, that their approach could identify only the best and the worst

teachers. As they conducted their study in an experimental setting, they provided the caveat

that replicating findings in the real world would require robust data. They also

recommended that once the worst and the best teachers have been identified using their

method, efforts should be made to identify the best practices for replication in other

classrooms. Thus, these researchers propose that student performance renders itself as a

viable evidence for high-stakes teacher evaluations.

Following Bingham et al. (1991), Wright, Horn, and Sanders (1997) explored

teacher effects on student performance. Given the arguments that non-random assignment of

students leads to a skewed assessment of teacher effectiveness in favor of those who receive

brighter students in their classes, they used a longitudinal dataset and gave special care in

their analyses to intra-class heterogeneity. They applied a mixed-model analysis of variance

to study the teacher effects on student achievement. In 20 of the 30 analyses that they

conducted, they found teacher effects to be larger than any other effects. Based on their

findings, they recommended using student achievement data to assess teachers. Wright et al.

(1997) stated that the “Differences in teacher effectiveness were found to be the dominant

factor in student academic gain…. The use of student achievement data from an

appropriately drawn standardized testing program administered longitudinally and

appropriately analyzed can fulfill these requirements” (p. 66).

Wößmann et al. (2007), employing multi-level modeling techniques on the PISA

2003 dataset, reconfirmed findings from the earlier studies (e.g., from Bishop, 1997, 1999)

and asserted that the external exit exams had positive relationships with student achievement

37

as measured by test scores after controlling for student, family, school, and country level

factors. Their study revealed that the schools using external exit exams in accountability

measures had students performing significantly better than otherwise. Wößmann et al.

(2007) contend that, these relationships, however, were indirect in the case of teachers and

schools, unlike students for whom there were direct incentives such as peer pressure for

learning.

OECD (2010a) found that schools that used standards-based external examinations,

which may be considered as a summative evidence of teacher performance in a high-stakes

evaluation, lead to enhanced (16 points higher) student achievement as measured by test

scores compared to schools that did not use such examinations. The same report states that

standardized tests conducted by schools had no discernible connection with student

performance, something which is true with regard to school performance in many countries.

Using student achievement data for accountability to the public such as through posting in

the media, informing parents about children’s progress, making decisions related to

allocation of resources, or tracking by administrative authorities had mixed relationships

with student performance. Another use of student achievement in the high-stakes

evaluations is in making comparisons across schools. School level accountability measures

such as comparing a school’s performance with district or national performance showed

positive relationships with student achievement (OECD, 2010a). OECD (2010a) also shows

that standardized examinations in combination with external exit exams as evidence of

teacher performance in high-stakes evaluations yielded positive associations with student

achievement. These aspects of accountability and use of achievement data were largely

summative in nature and were, on average, positively related to student achievement in

38

schools (OECD, 2010a). Based on its analyses, OECD (2010a) suggested that schools can

work with their accountability mechanisms to identify the best possible composition for

their accountability systems for optimal student learning outcomes.

Similarly, Goldhaber and Hansen (2010), using administrative data on teachers and

students (grades 4 or 5) showed that employing student test scores as evidence of teacher

performance in decisions relating to tenure (a high-stakes outcome) had significantly

positive effects on student achievement. Restricting their analyses to those teachers whose

performance was observed before and after the tenure, teachers who were not selected for

tenure had student achievement, on average, more than 11% of an SD lower than teachers

who were selected for the tenure. They conclude that using student test scores to measure

teacher effectiveness is a rational method to predict student achievement, and therefore is a

better alternative to assess teacher quality than using observable characteristics such as

holding a bachelor’s or master’s degree. They caution, however, owing to a restricted

sample in many senses, against generalizing the findings to the entire teacher workforce and

designing policies around such high-stakes decisions as granting tenure and retention.

At the same time, mixed or counter evidence and arguments also exist where high-

stakes approaches such as public accountability is either having a mixed effect (e.g., West &

Peterson, 2006), or is considered counterproductive (e.g., Wiggins & Tymms, 2002). West

and Peterson (2006) studied two accountability systems in Florida as parts of Florida’s A+

Plan and No Child Left Behind (NCLB) act. They found that Florida’s A+ Plan was more

effective compared to the NCLB at raising student achievement in schools labeled as “F”

and “D.” The authors attribute this effect to the targeted approach embedded in the A+ plan

where the lowest performing 10% of the schools were labeled as “F” and “D” and the lowest

39

2% schools with the threat of the voucher. For the lowest 2%, the authors argue, the stigma

attached as a failing school worked over and above the threat of the voucher. In contrast, the

NCLB’s accountability approach with its dichotomous categorization of schools as making

or not making adequate yearly progress (AYP) had no significant impact on student

achievement. The authors argue that the NCLB with its less targeted approach where a

relatively large percentage of schools may be labeled as “needing improvement” did not

entail as greater a threat of voucher or stigma of being labeled as low performing. Similarly,

Wiggins and Tymms (2002) studied accountability systems in English and Scottish primary

schools wherein the former published performance indicators in league tables while the

latter did not. They surveyed randomly selected schools in both the education systems to see

the perceptions of the key stakeholders on respective performance management systems.

They found that the English schools perceived their accountability systems more

dysfunctional than the Scottish schools. Also, in the case of English system, schools pursued

narrow targets in the curriculum by focusing more on those students who could potentially

improve schools’ position in the league tables. This showed that a public accountability

approach such as through publishing of league tables as a single proxy indicator of school

performance may have unintended negative implications for teaching and student learning.

The authors argue that these single proxy indicators do not work in isolation in schools.

Various other educational processes such as the pay system, the testing method, and various

cultural elements may interact with each other in complex ways thereby giving rise to

sometimes unwanted outcomes in an otherwise well-intended accountability system. In

other words, it can be imagined that a given system of accountability may or may not be

40

effective contingent upon the type of incentives (high-stakes) involved and particular social

and cultural contexts in which the schools are operating.

In brief, a significant amount of empirical evidence supports the view that attaching

high-stakes consequences to teacher performance may raise student achievement chiefly in

the form of student test scores. However, high-stakes approaches can only serve limited

purposes of evaluation without offering much leeway to schools to improve teachers’

professional practice. Also, as it turns out, simplistic conceptions of measuring teacher

effectiveness using student test scores does not come clear of pit-falls. Scholars (e.g.,

Ravitch, 2010; Darling-Hammond, Amrein-Beardsley, Haertel, & Rothstein, 2012) caution

over relying exclusively on summative outcomes such as student test scores in assessing and

judging teacher effectiveness. According to such perspectives attaching only high-stakes

consequences in teacher evaluations may lead to challenges as regards improving student

achievement. In this connection, some scholars suggest using summative evidence in the

form of student test scores as only one measure in the overall evaluation systems (Baker et

al., 2010; Glazerman et al., 2010; Goe et al., 2008; Mathis, 2012; Rothstein, 2010). These

scholars propose using additional measures in teacher evaluations such as standards-based

approaches including standard instruments of classroom observations, rubrics, and artifacts

of teacher work.

Interactions and student achievement. As stated earlier, high-stakes (and

accountability) and developmental purposes and practices do not always function in

isolation (Isoré, 2009; Looney, 2011). It so happens that in the complex interplay of

processes within schools, different aspects and purposes of teacher evaluation interact with

41

each other and with the other aspects of schooling in complex ways to generate significant

associations with student achievement.

For example, in the study by Wößmann et al. (2007), some of the findings showed

significant interactions where some aspects of schools’ governance such as “autonomy in

formulating budget” turned significantly positive, which independently were not positive,

when they interacted with accountability aspects of schools. Also, autonomy in establishing

salaries standing alone showed negative relations with student achievement while it showed

positive when combined with external exit exams. Similar results appeared in their analysis

when accountability was combined with autonomy in determining course content but the

relationship remained statistically insignificant. Wößmann et al. (2007) associated these

interactions to schools’ opportunistic behavior where an absence of accountability and

presence of autonomy lead to negative implications for student achievement. In contrast,

when school autonomy was combined with some form of incentive through an

accountability system, schools avoided opportunistic behavior and took steps that led to

improved student achievement. On the part of students, standardized examinations standing

alone led to negative relations with student achievement depicting an absence of incentive

for students to perform. However, when standardized examinations were combined with

external exit exams, the interaction turned significantly positive on student achievement.

This leads one to imagine that placement of an incentive along with an accountability

measure in an evaluation system bears positive associations with student achievement.

Schütz et al. (2007) in their study on autonomy, choice, and equity in student

performance showed that some of the accountability measures such as external exit exams

interacted with certain other features such as the socioeconomic (SES) status of students.

42

They found that external exit exams associated with an increase (37.42 beta coefficient) in

student achievement. This coefficient significantly decreased but still remained significant

(3.66 beta coefficient) when external exit exams were combined with students’ SES thereby

suggesting relations of other accountability measures with student achievement (Schütz et

al., 2007). Thus, while there is limited evidence on how various evaluation instruments,

approaches, and purposes interact with each other and with the other facets of schooling, it

can be argued, as does this study, that the different purposes of teacher evaluation do not

work in isolation. Therefore, this study also looks at various interactions as they relate to

different purposes and approaches of teacher evaluations in target countries.

Conceptual Framework and Research Hypotheses

The Figure 1 presents the conceptual framework for the study as well as hypotheses

therein. This framework draws heavily from the past literature as well as my own

understanding and experience of working as a teacher and principal for well over a decade

in a developing world context. It is proposed in the framework that student achievement is a

composite outcome of a mix of many direct and indirect factors at the student, school, and

country levels.

First, Wenglinsky (2002) in his study shows positive effects of classroom practices

on student achievement. This should mean that evaluating classroom practices with the

purpose to identify and to augment best practices should relate to improved student

achievement. In this regard, studies such as Taylor and Tyler (2011) show that teacher

evaluation systems with a focus on identifying best practices and enabling teachers to

improve their practice relate to improved student achievement in schools. Based on this, I

hypothesize that similar dynamics will operate in the 21 OECD countries included in this

43

study. These countries have a varying national focus on the developmental aspects of

teacher evaluation (OECD, 2009b). At the same time, we can expect that not all schools

follow national policies with the same level of fidelity and rigor during implementation.

Therefore, there should be a significant variation in terms of the relationships of

developmental aspects with student achievement.

Hypothesis 1: Teacher evaluation and monitoring with a developmental focus

associate with improved student achievement. This hypothesis is represented by

arrow 1 in Figure 1.

TE: High-stakes

Student

Achievement

School Factors

School Resources

Country factors (Expenditure and

Teacher perspectives)

TE (and monitoring

in test language):

Developmental

Background

(Student and Family)

1

5

6

3

2

4

Figure 1. Conceptual framework of the study. Arrows represent directions of

the relationships. Bold arrows 1, 2, and 3 represent hypotheses and the

direction of relationships.

44

Second, evidence (e.g., Wößmann et al., 2007) shows that high-stakes teacher

evaluation with accountability purposes is associated with improved student achievement.

Wößmann et al. (2007) found that external exit exams, observations by external observers

and principals, and interactions between external exit exams and standardized tests

associated with improved student achievement in science and mathematics as measured in

the PISA 2003 survey. Evidence like this leads to the second hypothesis of the study. I

hypothesize that a focus on high-stakes in teacher evaluation with the purpose to make

teachers accountable and make critical decisions relating to, for example, financial

remuneration should lead teachers to work harder and perform better in schools. In other

words, teacher evaluation leading to high-stakes consequences should cause teachers to

work hard and raise student achievement.

Hypothesis 2: High-stakes teacher evaluation approaches associate with better

student achievement. This is represented by arrow 2 in Figure 1.

Third, teacher evaluation in schools is a complex phenomenon where various

purposes of evaluation crisscross, overlap, and interact with the other aspects of schooling.

For example, a teacher evaluation conducted for accountability purposes may also have

implications for teachers’ professional development (Looney, 2011). Similarly, previous

studies (Schütz et al., 2007; Wößmann et al., 2007) have shown that the different aspects of

teacher evaluations interact positively with other schooling features thereby creating

positive associations with student achievement. In the light of this evidence, for example, it

can be imagined that the principal’s role in teacher evaluation and appraisal may be

influenced by the tensions generating due to external demands for accountability and

45

internal dynamics of teacher quality and improvement. In sum, we can expect interactions

between various aspects of teacher evaluation and schooling features and their significant

implications for student achievement. This leads to my third hypothesis in the study:

Hypothesis 3: Teacher evaluation practices interact with other schooling aspects

thereby producing positive associations with student achievement. This is

represented by arrow 3 in Figure 1.

Arrows 4, 5 and 6 show relationships between student achievement and different

control variables at student, school and country levels.

46

Chapter 3. DATA AND METHODS

This chapter discusses the datasets, the variables, and the methods of the study. It

describes the two datasets that I have used in this study. It presents and explains the

variables included in different models. It also gives rationale and empirical and theoretical

bases for the variables included in the various regression models. The chapter gives

descriptive statistics such as means, standard deviations, and percentages of the independent

and dependent variables of the study. The methods include discussions of how I have

managed the data such as missing cases and data reduction. The chapter closes with a

discussion of the methods and models employed.

Datasets

This study uses two sources of data. First, it uses part of the PISA survey conducted

by the OECD in 2009 in 65 countries. The PISA is a cross-national, large scale survey

instituted for the first time by the OECD in 2000. The survey is conducted every three years

and includes a paper-pencil test in the three subject areas of Mathematics, Science, and

Reading. In a given cycle, the PISA gives additional emphasis to one of the three subjects

by capturing supplementary information on the subject in the survey. The focus of the PISA

2009 survey was reading. In addition to the paper-pencil tests in the three subjects,

questionnaires on student background, climate, resources and management of the school,

and home learning environment including parental background are administered in the

survey. The student tests are given to a sample of 15-year olds in the sampled schools in

participating countries. An administrator in each test location ensures accurate distribution

of the test kits to the sampled students.

47

The student tests consist of two parts with a first two-hour session on cognitive skills

and knowledge in the three subjects and a half-hour session on background information on

students’ learning habits, attitudes, and motivation to learn. The knowledge and skills tests

explore how well students are able to connect their learning to real-life situations in living

environments. Thus, the tests are holistic: students do not only have to reproduce

knowledge, but they also are able to apply that knowledge in their lives. Therefore, being

holistic, extensive in its scope on educational outcomes, and being cross-national, the PISA

survey gives a detailed and comparative snapshot of how 15 year olds in participating

countries at the end of the compulsory education are ready to face the real world (OECD,

2012). This way, the PISA datasets make it possible for policymakers to identify optimal

factors that can garner higher standards of student performance in schools in cross-national

perspectives.

Second, the study uses information from OECD (2009b) report on the TALIS 2008.

Like the PISA survey, the TALIS is also a cross-sectional and cross-national survey

administered in 2008 to teachers and principals in 22 OECD and 2 partner countries. This

survey consists of extensive information on teachers in target schools. The survey provides

data related to work environment, beliefs, attitudes, and practices of teachers in participating

countries. One of the significant aspects of the TALIS is its extensive coverage of teacher

appraisal and feedback practices in lower secondary schools in the 24 countries. Principals

and teachers in the sampled schools furnish information on how teachers are evaluated in

schools, the criteria used, and the outcomes of teacher appraisal practices. Thus, unlike the

PISA survey which gives only principals’ responses on teacher evaluation practices, the

48

TALIS has a greater outreach by capturing teachers’ perspectives on teacher appraisal

practices.

While the PISA gives only principals’ perspectives on teacher evaluation practices,

the TALIS gives teachers’ perspectives, in addition to principals’, on teacher evaluation

practices and the effects of these practices on teachers’ professional lives. However, both the

surveys have their own downsides. One downside of the TALIS is the absence of student

level information especially student achievement in the survey. This absence of student

achievement data in the survey makes it difficult to draw inferences on how teacher

evaluation practices relate to student achievement in schools. On the other hand, PISA is

limited by an absence of teacher perspectives on how their lives are affected by teacher

evaluation practices and purposes. Therefore, in order to compensate to some degree for

these shortcomings in each survey’s data, I have ventured in this study to use the PISA and

TALIS in conjunction so as to create a robust dataset that can provide an enriched picture of

teacher evaluation practices and purposes in schools. It needs to be noted here that I am

using secondary information on the TALIS that are provided by an OECD report. The

OECD (2009b) report furnishes information in the form of teacher responses captured as

percentages against various items on teacher appraisal and feedback in the TALIS survey

(See Appendices C-E). I am using this information as part of country level variables in the

regression models to control for teachers’ perspectives on teacher appraisal and feedback

practices and purposes.

Sample and Sampling Strategy

The PISA and TALIS surveys consist of complex survey designs which are

multilevel in structure. Except for the Russian Federation, students in the PISA 2009 survey

49

were sampled through a two-stage sampling process with the first stage being the school and

the second stage being students within schools (OECD, 2012). In the case of Russia, the first

stage was region rather than school. Further stratification of the schools was based on

characteristics such as school type, funding, and location (urban/rural). A model assuming

bivariate normal distribution for propensity was used to ensure maximum inclusivity with

exclusion limited between 2-5% in the participating countries (OECD, 2012). A total

number of 475,460 students participated in the survey representing around 26 million young

children in schools in the participating countries and economies.

The study uses part of the PISA 2009 data consisting of 21 countries that make up

the bulk of the sample in the TALIS 2008. The remaining PISA countries have been

dropped from the sample for this study since they were not part of the TALIS 2008 survey

and hence the teacher perspectives could not be used for these countries. Malaysia, Malta,

and the Netherlands that are originally part of the TALIS 2008 survey have also been

dropped from the analysis for various reasons. Malaysia and Malta were not part of the

PISA 2009 survey and have been dropped from the analysis. Furthermore, according to

OECD (2010b), participation in the Netherlands was too low with an un-weighted

participation rate of 16.7% thereby making it impossible to draw parametric inferences

about the target population in the Netherlands. The Table 3.1 gives the number of cases

(212,955) in 8,116 schools in the sample with Iceland having the least (3,646) and Mexico

the most (38,250) number of cases.

As expected of a survey, missing cases were encountered in the sample. These

missing cases were dealt with by list-wise deletion in some cases and dummy coding along

with mean country mean substitutions in the others. List-wise deletion resulted in about

50

Table 3.1

Countries and Cases

Country Country ID No. of Cases

No. of

Schools

Mean Student

Weight

Australia 36 14,251 353 16.93

Austria 40 6,590 282 13.28

Belgium 56 8,501 278 14.04

Brazil 76 20,127 947 103.49

Bulgaria 100 4,507 178 12.87

Denmark 208 5,924 285 10.35

Estonia 233 4,727 175 2.75

Hungary 348 4,605 187 22.94

Iceland 352 3,646 131 1.21

Ireland 372 3,937 144 13.41

Italy 380 30,905 1,097 16.40

Korea 410 4,989 157 126.31

Lithuania 440 4,528 196 8.95

Mexico 484 38,250 1,535 34.00

Norway 578 4,660 197 12.31

Poland 616 4,917 185 91.25

Portugal 620 6,298 214 15.34

Slovak Republic 703 4,555 189 15.21

Slovenia 705 6,155 341 3.06

Spain 724 25,887 889 14.99

Turkey 792 4,996 170 151.61

N -- 212,955 8,116 --

1.24% reduction in the original sample size by reducing the number of cases to 210,307.

More details on missing data management are given in the section on variables below.

Though the students in the PISA 2009 survey were randomly sampled from within

schools, there may be factors that lead to selection bias. For example, a student who was

selected but did not participate in the survey or students in schools with greater enrollments

having greater chance of being selected than students in small schools may add to biases in

estimations of standard errors and coefficients. In order to offset such selection biases and

51

other sampling errors, the student level weights are introduced into the data files. These

student weights produce representative estimates for coefficients on continuous and

categorical variables (OECD, 2012). In the absence of these weights, the results will be

applicable only to the students who are part of the PISA 2009 survey and not to the entire

target population of students. Therefore, this study uses weights to ensure representativeness

to make meaningful and accurate inferences about the target population.

Variables and Missing Data Management

To make the models as parsimonious yet as rigorous as possible, I have selected only

those variables from the literature that had a strong relevance to student achievement. With

the outcome variables being student achievement in science, mathematics, and reading,

predictors are taken from three levels.

The outcome variables in this study are the student test scores in the three subjects of

mathematics, science, and reading. These test scores are based on knowledge and cognitive

tests administered to 15 year olds in the participating countries. Student performance in

these tests is reported as plausible values (PVs) in the PISA survey. Each student in a given

subject has a set of five PVs. Reading has additional PVs to assess students’ digital reading

abilities known as Digital Reading Assessment (DRA). Since not all countries in the study

sample participated in the DRA, this study uses only five PVs in reading which are based on

paper-pencil tests.

PVs are not the actual test scores of students. Instead, through standard procedures

of multiple imputations, a range of scores are generated that depict a range of points that a

student can possibly score. Putting simply, PVs are:

…a representation of the range of abilities that a student might reasonably have.

52

(…). Instead of directly estimating a student’s ability θ, a probability distribution for

a student’s θ, is estimated. That is, instead of obtaining a point estimate for θ…a

range of possible values for a student’s θ, with an associated probability for each of

these values is estimated. Plausible values are random draws from this (estimated)

distribution for a student’s θ. (Wu & Adams, 2002 quoted in OECD, 2009c, p. 43)

Thus, using these PVs as outcome variables, this study predicts student achievement

using main predictors that include principals’ responses to questions in the PISA survey

relating to teacher monitoring, evaluation, and accountability. I have grouped these

predictors into three main categories with the first category having two sub-categories. I will

first describe these categories followed by an account of the empirical bases for selecting

these variables as predictors in my study.

Developmental. This category is further divided into two sub-categories: monitoring

in test language, and principals’ pedagogical role and use of student assessments for

instructional improvement.

Monitoring in test language. The PISA 2009 survey gathered supplementary

information by placing additional emphasis on reading (mentioned here as test language). In

the context of this study, test language is considered to be the same as the language in

reading (see footnote 5 on page 18). The PISA 2009 survey asked principals about their

approaches to monitoring teachers of test language in their schools. They were asked about

the kind of evidence that their schools used to monitor teachers in test language. The set of

variables consisting of principals’ responses to this question have been categorized as

“monitoring in test language.” Principals recorded their responses in “yes/no” format if they

used or not used “student achievement,” “peer reviews,” “principal and staff observations,”

53

and “observations by an external authority” as sources and tools to monitors teachers in test

language. Principals’ “yes” response has been coded as 1.

Earlier studies show that these sources of evidence on teacher performance bear

significant relations with student achievement. Classroom observations by principals and

senior staff were found to relate positively with student achievement (Schütz et al., 2007).

Observations by principals, peers, and external observers have also been found to relate

positively but insignificantly in teacher monitoring practices with accountability purposes

(Wößmann et al., 2007). Wößmann et al. (2007), however, showed that monitoring by

principals was more pronounced than external monitoring where principals’ observations of

teachers became significant at 10.5% of alpha-levels. Only in few instances external

observations showed positive relations with student achievement after controlling for

observations by principals. While Wößmann et al. (2007) looked at teacher monitoring in

mathematics with an accountability lens, I have looked at this group of variables with a

“developmental” lens. As defined in the introductory chapter, “monitoring” is on-going

tracking of the progress towards set goals (DAC, n.d.; UNDP, 2009). This means that the

objective of monitoring is to assess the direction and mode of progress so as to make

necessary adjustments down the line to ensure goal achievement. In this sense, monitoring

has a developmental purpose more than accountability.

Principals’ pedagogical role and the use of student assessments for instructional

improvement. This category includes items that seek information from principals on

developmental approaches to assessing teachers and finding ways to improve teacher

practice. These items capture principals’ categorical responses to the question if student

assessments are used to improve instruction in their schools. Principals’ “yes” response is

54

coded as 1. It also consists of items that ask principals if s/he observes teachers in

classrooms and if s/he suggests teachers to develop professionally. In this category, the

predominant tools that are of interest are use of student assessments to improve instruction

and classroom observations by principals. Additionally, since classroom observations

should lead to an action by principals such as arranging for professional development of

teachers, two additional items have been included regarding principals’ role in teachers

professional development. These include principals suggesting teachers for improvement

and principals informing teachers about possibilities for updating their knowledge and skills.

Principals responded on a 4-point Likert scale as: 1=Never, 2=Seldom, 3=Quite often,

4=Very often. Responses 3 and 4 are linearly coded as 1.

Previous studies show that the items included in the developmental category have

been found to relate significantly to student achievement. For example, Feldman and Tung

(2001) studied Data Based Inquiry and Decision Making (DBDM) in six schools. They

found that the use of student data by teachers and schools related positively with student

achievement in these schools. Wayman and Stringfield (2006) also found that the use of

technology in making sense of student data resulted in improved student performance.

These studies indicate the educational utility of using student achievement data in

educational decision making. This utility can also be construed of as important in evaluating

teacher performance in schools. Furthermore, classroom observations in standards-based

teacher evaluations have been found to relate positively to student achievement. For

example, Kimball et al. (2004) in their study found positive and significant relationships

between teacher evaluation ratings and student achievement. However, such relationships

were not significant in all grades. Coefficients were positive only in reading in grade 4 and

55

in all tests in grade 5. Other studies found moderately positive relations between student

achievement and teacher evaluation using standards measures such as classroom

observations based on pre-defined rubrics (e.g., Kimball et al., 2004; Milanowski, 2004;

Milanowski et al., 2004; Sartain et al., 2011; Taylor & Tyler, 2011; Tyler et al., 2010).

While these studies showed positive links between standard measures of teacher

performance and student achievement, Holtzapple (2003) highlights that teacher evaluations

involving classroom observations through standard rubrics successfully predicted

performance only at the extremes (unsatisfactory and distinguished) of the ratings. In the

middle ranges of the performance ratings the approach was not as effective (Holtzapple,

2003). Thus, there is significant empirical base that identifies positive though at best mixed

relations between student achievement and the developmental approaches to evaluating

teachers.

High-stakes. This category includes items such as the use of student assessments to

evaluate teachers and to judge their effectiveness, if student assessments are tracked by an

external authority, and if such assessments are posted publicly. Principals’ “yes” response is

coded as 1. Theoretically, and in most cases practically, such uses of student assessments

carry high-stakes outcomes for teachers in their evaluations. Therefore, this category has

been named as “high-stakes.” Studies show that the use of student achievement and

assessments as evidence of teacher performance in teacher evaluations associate positively

with student achievement (Bingham, Heywood, & White, 1991; Goldhaber & Hansen,

2010; OECD, 2010a; Wößmann et al., 2007). There is also evidence that such use of student

assessments does not always lead to positive relations OECD (2010a). According to OECD

(2010a), external exams had positive effects on student test scores whereas standardized

56

tests conducted internally by schools had no noticeable relationship with student test scores.

In their study of students of grades 4 and 5, Goldhaber and Hansen (2010) find that student

test scores used as evidence of teacher performance in high-stakes teacher evaluations

showed significantly positive relations with student achievement. Similarly, accountability,

including public accountability as an approach in evaluations in schools has been found to

have mixed effects OECD (2010a). Furthermore, mixed or counter evidence and arguments

have also been suggested in other studies where the use of accountability in general does not

always lead to improved student achievement or expected outcomes (e.g., West & Peterson,

2006; Wiggins & Tymms, 2002).

Interactions: As Looney (2011) mentions, high-stakes purposes of teacher

evaluation function in a crisscross fashion with the developmental purposes of it. Also,

various schooling features have been found to interact with each other to produce positive

relations with student achievement (Schütz et al., 2007; Wößmann et al., 2007). Thus, in the

light of such empirical evidence, I have produced three interaction terms with the

assumption that teacher evaluation practices will create meaningful interactions with other

schooling features.

The first interaction term is created between the relative importance of classroom

observations as a tool in teacher appraisal and feedback and parents being informed about

the progress of their children. Research (e.g., Fan & Chen, 2001; Hoover-Dempsey &

Sandler, 1997; Ingram, Wolfe, & Lieberman, 2007; Jeynes, 2012; Sui-Chu & Willms, 1996)

suggests that parents play a significant role in the education of their children. Parents

involve themselves in schooling of their children for different reasons. These reasons may

include their belief about parental role in children’s education, parents’ sense of efficacy in

57

enabling children to succeed, parents’ belief that parental involvement will improve

children’s performance, and parents’ perceptions of the way children and schools like them

to get involved in schooling (Hoover-Dempsey & Sandler, 1997). Empirical evidence also

suggests that parental role establishes positive relationships with student achievement in

different school settings (e.g., Fan & Chen, 2001; Ingram et al., 2007; Jeynes, 2012; Sui-

Chu & Willms, 1996). It is with this theoretical and empirical grounding that this interaction

term scrutinizes if informing parents, as one form of accountability of teacher performance,

had any link with classroom observations becoming more important in teacher appraisal

practices in schools. Classroom observations are a predominant tool in teacher evaluations

across countries (Isoré (2009). Classroom observations are also considered to be one of the

most effective tools in probing within-classroom processes and interactions (Danielson &

McGreal, 2000). Therefore, I assume that when parents are involved in the education of

their children and when such an involvement also includes aspects of accountability, it

should lead principals to be mindful of the quality of teaching practices in classrooms.

Principals would be closely monitoring teachers so that their schools are able to present to

parents quality reports regarding the performance of their children.

The second interaction term is created between importance of classroom observation

as a tool in teacher appraisal and feedback and principal being responsible for making salary

changes. Wößmann et al. (2007) found positive interactions between “autonomy in

formulating budget” and accountability practices in schools. They also found that principals’

autonomy in schools interacted positively with external exits exams. This interaction points

towards a possible underlying dimension where principals’ authority to make changes in

teachers’ salaries may convey a “high-stakes” message to the teachers. This “high-stake”

58

may play in the form of classroom observations becoming a tool to push teachers for

working hard to produce better student achievement. Accordingly, teachers may resort to

classroom practices that can produce better student achievement and hence a favorable

outcome in their evaluations. Therefore, this interaction explores any relationship between

classroom observations being significant in terms of student achievement when principals in

schools hold some level of authority involving high-stakes implications for teachers. In

other words, this term explores any effect(s) of principals having a significant authority in

making changes in teachers’ salaries and if this factor is important in making classroom

observations an effective teacher evaluation tool in relation to student achievement.

The third interaction term consists of principals observing classes “often” and “very

often” and school type being “private.” OECD (2010a) shows that while there is no

significant relationship between reading performance and governance type after controlling

for socioeconomic factors, there is a significant difference in the index of school principals’

leadership between public and private schools. The index of school leadership evaluates

principal’s pedagogical role in improving quality of instruction and overall learning

environment in schools. Therefore, the aim of this interaction is to uncover any relationship

between principals’ observations of teachers in classes as a teacher development tool and

school type being private with the underlying assumption that principals are more assertive

as reflected in a higher index for their leadership in private schools compared to the public

schools.

Control variables at student, school, and country levels. This study uses a number

of controlling factors that are significant with respect to student achievement. At the student

level, student’s family background such as socioeconomic status, individual background

59

such as gender, grade (Fuchs & Wößmann, 2007), home language if it is other than test

language, and immigration status whether the student is a first generation immigrant (Zhang

& Lee, 2011) have been included in the study models. School level control variables include

school type, student teacher ratio (Demir, Kılıç, & Ünal, 2010; Zhang & Lee, 2011),

percentage of girls, proportion of qualified teachers, and percentage of computers connected

with the internet, and shortage of teachers (Zhang & Lee, 2011). At the country level,

dollars spent on education (obtained as a product of GDP and percent expenditure on

education) and teacher evaluation criteria and outcomes have been included in the models.

The teacher evaluation criteria and outcomes include 8 and 6 variables respectively that are

based on information from the OECD (2009b) report on the TALIS 2008 (see Appendices

C, D, & E). These criteria and outcomes are converted into three components through

principal component analysis. Indices are created through regressions scores (See the

section on data reduction below). Complete details on measurement, definitions, and coding

schemes can be found in Appendix F for all the variables included in the study.

Missing data management. Like any survey, the PISA survey also suffers from

issues of missing cases. The PISA 2009 being a representative survey of the target

population, this study assumes these missing cases to be MCAR (Missing Completely at

Random). With this assumption, this study approached missing cases in two ways. First,

missing cases in small proportions in control variables were dropped from the analysis

through a list-wise deletion to keep the balance in the sample design. All other missing cases

were first managed through Multiple Imputations (MIs). While every missing data

management approach has its own pros and cons, MI has benefits over other approaches

such as list-wise deletions. MIs give a number of plausible values for missing cases based

60

on the number of imputations. These plausible values carry the uncertainty and errors

associated with the missing values thereby giving more stable estimations (Rubin, 1987).

With these benefits of the MIs, I ran five imputations for each missing variable followed by

running the models containing five sets of multiply imputed datasets and recorded the

results. However, the MI approach did not work out for technical reasons. One drawback of

running the regression analyses on multiply imputed datasets using standard procedures

recommended by the OECD resulted in non-reporting of average r-square values in the

Stata® outputs. In order to deal with this issue, I followed a second approach to manage the

missing data. First, a total of 2,648 cases were dropped from the analysis. These missing

cases came from only two control variables—student grade (663 cases) and from index of

socioeconomic and cultural status of students (1,985 cases). This list-wise deletion of cases

reduced the original sample by 1.24%. In all other instances, dummy variables were created

for missing followed by country mean substitutions in the original missing cases. The

models were re-run using the new datasets. The results of both the approaches—MI and the

dummy variables and country mean substitutions—returned similar results in terms of the

direction and significance of the coefficients. However, the advantage with the dummy

variables and country mean substitutions was the retrieval of r-square statistic to see the fit

of the models. Therefore, only the results produced through dummy variables and country

mean substitution have been reported and discussed in this dissertation.

Descriptive statistics. The Table 3.2 gives weighted descriptive statistics such as

means and standard deviations for the dependent, main independent, and control variables.

The Table 3.2 shows in the first block the dependent variables which are student test scores

in the knowledge and cognitive tests administered in the PISA 2009 survey. According to

61

Table 3.2

Descriptive Statistics for Main and Control Variables

Variable M SD Min Max

Dependent

Aggregate plausible value in Math 447.83 98.20 21.00 802.31

Aggregate plausible value in Science 455.70 94.60 37.71 839.74

Aggregate plausible value in Reading 458.00 96.34 67.60 847.10

Developmental

Teacher monitoring in test language

Student achievement (“Yes” coded as 1) 0.70 0.46 0.00 1.00

Peer reviews (“Yes” coded as 1) 0.71 0.45 0.00 1.00

Principal and staff observations (“Yes”

coded as 1)

0.62 0.49 0.00 1.00

Observations by external authority (“Yes”

coded as 1)

0.30 0.46 0.00 1.00

Principals’ pedagogical role and use of student

assessments for instructional improvement

Classroom observations by school principal

(“Quite often” and “very often” coded as 1)

0.58 0.49 0.00 1.00

Principals suggesting teachers for

improvement (“Quite often” and “very

often” coded as 1)

0.81 0.38 0.00 1.00

Principals informing teachers for updating

knowledge and skills (“Quite often” and

“very often” coded as 1)

0.91 0.29 0.00 1.00

Assessments used for instructional

improvement (“Yes” coded as 1)

0.81 0.39 0.00 1.00

High-Stakes Teacher Evaluation

Public accountability for student

performance (“Yes” coded as 1)

0.34 0.47 0.00 1.00

Student assessments used for evaluating

teachers (“Yes” coded as 1)

0.61 0.49 0.00 1.00

Student assessments used for judging

teacher effectiveness (“Yes” coded as 1)

0.63 0.48 0.00 1.00

Student assessments tracked by an

administrative authority (“Yes” coded as 1)

0.73 0.44 0.00 1.00

62

Table 3.2

Descriptive Statistics for Main and Control Variables (continued)

Variable M SD Min Max

Student

Age 15.78 0.29 15.25 16.33

Girl (coded as 1) 0.51 0.50 0.00 1.00

Grade compared to modal grade in the

country

-0.17 0.75 -3.00 3.00

First generation immigrant (coded as 1) 0.02 0.13 0.00 1.00

Second generation immigrant (coded as 1) 0.01 0.11 0.00 1.00

Home language other than test language

(coded as 1)

0.04 0.20 0.00 1.00

Index of socioeconomic and cultural status -0.73 1.25 -5.71 3.55

School

Principal’s sex (“Female” coded as 1) 0.40 0.49 0.00 1.00

School type (“Public” coded as 1) 0.84 0.37 0.00 1.00

School size 890.17 756.24 2.00 11268.00

Teacher shortage 0.23 1.17 -1.02 3.34

Proportion of qualified teachers 0.87 0.26 0.00 1.00

Percent girls 50.19 17.00 0.00 100

Student teacher ratio 21.56 16.07 0.27 723.00

Proportion of computers connected to web 0.88 0.25 0.00 1.00

Country

Professional outcomes (e.g., student test

scores, retention and pass rates ) as teacher

evaluation criteria

-9.18e-09

2.43 -8.12 2.75

Others (e.g., parental feedback and relations

with colleagues) as teacher evaluation

criteria

-1.55e-08

1.09 -1.66 2.36

Outcomes and impact of teacher evaluation -1.57e

-09

2.10 -4.83 4.63

Dollars spent on education 883.44 560.51 336.40 3912.80

N= 210307

this table, the mean score for mathematics in the 21 countries is 447.83 (SD = 98.20).

Similarly, mean scores for science and reading are 455.70 (SD = 94.60), and 458.00 (SD =

63

96.34) respectively. These descriptive statistics have been obtained on the aggregate means

of all five plausible values for all students in each of the tested subject.

The first sub-category under the “developmental” block in the Tables 3.2 and 3.3

consists of the independent variables that seek evidence on teacher performance in

monitoring the practice of teachers in test language. A good majority of students was

enrolled in schools where principals (M = 0.70, SD = 0.46) responded as having used

student achievement data over the last year to monitor teachers in test language. The least

used approach (M = 0.30, SD = 0.46) was observations by external authority. As the Table

3.3 shows, 69.96% of students were enrolled in schools that used student achievement data

in teacher evaluations as against only 30.26% students enrolled in schools with observations

by an external authority as a means to monitor teachers in test language. Schools using peer

reviews and principal and staff observations had 71.30% and 62.07% (SD = 0.49) students

enrolled respectively (see Table 3.3).

The second sub-category under the “developmental” blocks in Tables 3.2 and 3.3

shows the descriptive statistics on principals’ pedagogical role as it relates to teacher

evaluation and use of student assessments for instructional improvement. On average,

57.65% of the students studied in schools where principals observed teachers in classes

“often” or “very often” (Table 3.3). On the contrary, only 6.94% students were enrolled in

schools where principals never observed teachers in classes. This shows classroom

observation as somewhat a favorite mode that principals use to assess teachers’ performance

in classes, and to use information from such observations to evaluate teachers and possibly

to arrange and suggest teachers for professional development. Principals were also found to

often suggest teachers for improvement in the latter’s practice with a mean response of 0.81

64

Table 3.3

Frequencies and Percentages of Main Categorical Variables

Main Variable

Categorical

Response Freq. Percent

Developmental

Teacher monitoring in test language

Student achievement used to assess teachers in test

language

Yes 147,1278 69.96

No 59,788 28.43

Peer reviews used to assess teachers in test

language

Yes 149,958 71.30

No 57,310 27.25

Principal and staff observations used to assess

teachers in test language

Yes 130,547 62.07

No 75,981 36.13

Observations by external authority used to assess

teachers in test language

Yes 63,636 30.26

No 142,555 67.78

Principals’ pedagogical role and use of student

assessments for instructional improvement

Classroom observations by school principal Never 14,597 6.94

Seldom 72,097 34.28

Quite often 91,790 43.65

Very often 29,437 14.00

Principals suggestions to teachers for

improvement

Never 2,707 1.29

Seldom 34,101 16.21

Quite often 103,665 49.29

Very often 67,468 32.08

Principals informing teachers for updating

knowledge and skills

Never 1,217 0.58

Seldom 16,508 7.85

Quite often 88,619 42.14

Very often 101,690 48.35

Assessments used for instructional improvement Yes 169,864 80.77

No 30,748 14.62

High-stakes teacher evaluation

Public accountability for student performance Yes 71,075 33.80

No 134,822 64.11

Student assessments used for evaluating teachers Yes 129,214 61.44

No 76,887 36.56

65

Table 3.3

Frequencies and Percentages of Main Categorical Variables (Continued)

Main Variable

Categorical

Response Freq. Percent

Student assessments used for judging teacher

effectiveness

Yes 132,391 62.95

No 67,064 31.89

Student assessments tracked by an administrative

authority Yes

153,685 73.08

No 51,277 24.38

Note: Frequencies for missing cases have been omitted from Table 3.3.

(SD = 0.38) as shown in the Table 3.2. This mean reflects in 81.37% of the students enrolled

in schools where principals often or very often suggested teachers for improvement.

Similarly, 90.49% (M = 0.91, SD = 0.29) students were enrolled in schools where principals

often or very often informed teachers about possibilities for updating their knowledge and

skills. Regarding the use of student assessments for instructional purposes which may

essentially has a developmental focus from teacher evaluation perspectives, 80.77% (M =

0.81, SD = 0.39) of the students studied in schools where principals claimed that they used

student assessments for instructional purposes (see Tables 3.2 and 3.3). This highlights a

predominance of developmental approaches to teacher evaluation in schools in 21 countries.

The third blocks in Tables 3.2 and 3.3 give descriptive information on high-stakes

approaches to teacher evaluation in the 21 countries. Relatively fewer students (33.80%)

were enrolled in schools with public accountability in place. Accordingly, the mean

response for this categorical variable stood at 0.34 (SD = 0.47). On the contrary, the

predominant modes of high-stakes approaches to teacher evaluation were through the use of

assessments for teacher evaluation (M = 0.61, SD = 0.49), tracking of assessments by an

66

administrative authority (M = 0.73, SD = 0.44), and judging teacher effectiveness (M =

0.63, SD = 0.48). Student enrollment in schools with these practices remained 61.44%,

73.08% and 62.95% respectively for each of the high-stakes teacher evaluation approach.

Blocks 5, 6, and 7 in Table 3.2 show means and standard deviations for the

independent control variables at student, school, and country levels. As the Table 3.2 shows,

mean student age in years in the 21 countries is 15.78 (SD = 0.29). Girls, coded as 1,

slightly outnumber boys by a margin of 1 percent (M = 0.51, SD = 0.50). Mean modal grade

for these countries comes out to be -0.17. Modal grade was computed as an index to capture

between country variations (OECD, 2012). A mean of -0.17 (SD = 0.75) shows that the

average modal grade for students in the 21 countries was below the expected modal grade

which is given a value of 0 on the modal grade index. The index of socioeconomic and

cultural status shows a mean of -0.73 (SD = 1.25). Immigration status of students was coded

as 1 if they were first generation immigrants. Similarly, second generation immigrant status

was also coded as 1. Regarding immigration status, 2% (SD = 0.13) of students were first

generation immigrants in the 21 countries. These immigrant students and their parents both

were born outside the country in which students took the PISA tests. In addition, 1% (SD =

0.11) of students had second generation immigrant status, meaning that these students, and

not their parents, were born in the country of assessment. Home language, if it was other

than test language, was coded as 1. For 4% (SD = 0.2) of students, home language was other

than the test language.

As the Table 3.2 shows, the large majority (84%, SD = 0.37) of students is enrolled

in public schools. Schools with a female principal enrolled 40% (SD = 0.49) of students.

Sex of the principal when it was female was coded as 1. Similarly, school type being public

67

was also coded as 1. The school size shows a huge variation with a mean enrollment of 890

students. With this average enrollment, average student teacher ratio stands at a 21.56 (SD =

16.07). The PISA 2009 survey asked principals about other aspects of schooling as well

such as teacher shortage. Teacher shortage measured on an index showed a mean of 0.23

(SD = 1.17) suggesting only a moderate shortage of teachers in the sampled schools. Eighty-

seven percent (SD = 0.26) of the students were enrolled in schools where teachers had

qualifications equivalent to International Standard Classification of Education (ISCED) 5A

level.7 With respect to the technological resources, 88% (SD = 0.25) of students studied in

schools where computers were connected to the Internet.

As the Table 3.2 shows, there are four country control variables. Three of the four

variables are derived as components through exploratory Principal Component Analysis (see

the next section on data reduction). These components are formed using findings on teacher

responses on items related to teacher appraisal criteria and outcomes in the TALIS 2008 as

reported in OECD (2009b). Indices have been generated for all three components through

regression after running the component analysis. The first component has been named as

“professional outcomes” that includes such teacher evaluation criteria as student test scores

and retention and pass rates. The second component has been named as “others.” This

7 Proportion of qualified teachers is defined as teachers having ISCED 5A level of

qualifications. ISCED 5A qualifications are equivalent to bachelors, masters or equivalent

qualifications designed to provide theoretical grounding to students in subjects of their

interest so as to enable them to gain entry to more advanced, research oriented tertiary level

of education.

68

component carries two teacher evaluation criteria namely feedback from parents and

relations with colleagues. The third component has been named as “teacher evaluation

outcomes and impact” that captures information on such “high-stakes” consequences as

changes in teachers’ salaries and advancement in their careers. The last variable shows that

an average of 883.44 (SD = 560.51) dollars per capita income per child was spent on

education in these countries.

The Table 3.4 gives correlations among the main categorical predictors of the study.

The table shows low to moderate correlations among different predictors. The two variables

that are least correlated (r = -.03) are principals informing teachers for updating knowledge

and skills and observations by external authority in monitoring teachers of test language. On

the contrary, the two most correlated (r = .41) variables are the student assessments used for

evaluating teachers and judging teacher effectiveness. These two variables also moderately

correlate with student achievement used as an evidence in monitoring of teachers in test

language with correlations of .33 and .31 respectively. This is intuitive since in all these

approaches of monitoring and evaluation, the primary source of evidence for teacher

performance is student performance in different types of assessments. With these moderate

values we can still keep these predictors in the models for data analysis.

Similarly, principals observing teachers in classes shows moderate positive

correlations with principals suggesting teachers for improvement (r = .33) and principals

informing teachers about updating knowledge and skills (r = .22). Likewise, principals

suggesting teachers for improvement and informing them about possibilities for updating

their knowledge also have a moderately positive correlation (r = .38). Another moderate

correlation (r = .40) can be seen between principals observing classes often and very often

69

Table 3.4

Correlations among Main Predictors

Student

achievement

Peer

reviews

Principal

and staff

observations

Observations

by external

authority

Classroom

observations

by school

principal

Principals

suggesting

teachers for

improvement

Student achievement 1

Peer reviews .292 1

Principal and staff observations .338 .228 1

Observations by external

authority

.206 .076 .346 1

Classroom observations by school

principal

.212 .161 .403 .160 1

Principals suggesting teachers for

improvement

.152 .206 .157 .059 .332 1

Principals informing teachers for

updating knowledge and skills

.096 .140 .049 -.026 .224 .379

Student assessments used for

instructional improvement

.161 .247 .091 .079 .071 .160

Public accountability for student

performance

.152 .086 .121 .043 .077 .087

Student assessments used for

evaluating teachers

.331 .185 .261 .121 .263 .226

Student assessments tracked by

an administrative authority

.270 .158 .224 .153 .165 .170

Student assessments used for

judging teacher effectiveness

.307 .195 .239 .107 .195 .210

70

Table 3.4

Correlations among Main Predictors (continued)

Student

assessments

used for

instructional

improvement

Principals

informing

teachers for

updating

knowledge

and skills

Public

accountability

for student

performance

Student

assessments

used for

evaluating

teachers

Student

assessments

tracked by an

administrative

authority

Student

assessments

used for judging

teacher

effectiveness

Student assessments used for

instructional improvement

1

Principals informing teachers

for updating knowledge and

skills

.134 1

Public accountability for

student performance

.078 .086 1

Student assessments used for

evaluating teachers

.168 .136 .205 1

Student assessments tracked

by an administrative authority

.161 .119 .174 .320 1

Student assessments used for

judging teacher effectiveness

.343 .104 .119 .410 .270 1

71

and principals and staff observing classes in monitoring practice of test language teachers.

This suggests that principals who frequently observe classes may do so in general for all

subjects including reading. Another important moderate correlation is found between

student assessments used for instructional improvement and student assessments used for

judging teacher effectiveness. These two variables establish a correlation of .34 with one

another. The tracking of student assessments also establishes moderate correlations with the

use of student assessments for evaluating teachers (r = .32) and student assessments used for

judging teacher effectiveness (r = .27).

Based on these moderate correlations, it is logical to include these variables in the

regression models of this study. All these correlations have been kept under Variance

Inflation Factor (VIF) check for any multicollinearity issues. No significant

multicollinearity issues were noted with a mean VIF of 2.21.

Data Reduction

As can be seen in the Appendices C-E, the information taken from the TALIS 2008

to represent country level constructs of teacher evaluation falls into some 30 variables. It

can be imagined that a number of variables may be measuring the same phenomenon as

regards teachers’ appraisals and feedback suggesting underlying dimensions that cut across

this full range of items. Such an assumption holds valid in the real life of a school where

various aspects of teacher evaluation in schools often do not work in isolation. Complexity

of the school environment allows us to imagine many of these aspects to be correlated in

significant ways. Such underlying dimensions can be uncovered through a factor or

component analysis (Thomson, 2004). This study uses principal component analysis to see

any theme(s) cutting across this long list of variables coming from the TALIS 2008. The

72

purpose of the component analysis was also to reduce the number variables into viable

components that can then be used within the available degrees of freedom in the regression

analyses so as to make more logical connections between teacher appraisal practices and

student achievement. Two separate principal component analyses were carried out followed

by score generation for each component.

The first component analysis and score generation was run on teacher appraisal

criteria wherein 15 teacher evaluation criteria (see Appendix C) were subjected to this

procedure. The TALIS 2008 originally asked teachers on 17 criteria used for their appraisal

and feedback. However, I dropped two criteria “teaching students with special learning

needs,” and “teaching in a multicultural setting” from the component analysis for the reason

that OECD (2009b) reported that these two variables received relatively low importance in

teacher appraisal and feedback. The TALIS 2008 asked teachers how important were the 15

criteria when they received their appraisal and/or feedback. Teachers responded on a scale

of 1-5 with 1 representing “I do not know if it was considered” and 5 representing

“considered with high importance.” OECD (2009b) gives the last two responses “considered

with moderate importance,” and “considered with high importance” as percentages of

teachers who reported so.

Before carrying out the component analysis, I ran a simple correlation on these

variables which showed that some items were highly correlated with one another. This

meant that the information captured by one variable was essentially the same as captured by

the second variable in the highly correlated pair. This also meant that keeping both the

highly correlated variables in the analysis would lead to inflation of variance explained by

these variables as well as redundancy of the information contained in the resulting

73

components. Therefore, one variable from each pair of the variables consisting of

correlations greater than .95 have been dropped from further component analysis. This gave

to a total of 8 variables in criteria on teacher appraisal and component analysis has been run

accordingly.

The Appendix G gives the results of component analysis and score generation for

these variables. As the Table G1 in Appendix G shows, two components gave Eigen values

(EV) greater than 1. The first component carried an EV of 5.97 while the second carried an

EV of 1.15. Cumulatively, these two components explained 89% of the variance attributed

to the 8 variables in teacher appraisal criteria and outcomes. These components were

subjected to promax factor rotation. The Table G2 shows the factor rotations of the 8 criteria

that teachers rated as important or highly important in their evaluations. Six of these criteria

were loaded onto the first component with component loadings varying between 0.34 and

0.39. These criteria included student test scores, retention and pass rates, other student

learning outcomes, direct appraisal of classes, innovation in teaching, and professional

development undertaken by the teachers. A close scrutiny of these criteria reveals that these

are mostly counted as professional outcomes that teachers are supposed to show in their

performance appraisals. Therefore, I have named this component as “professional

outcomes” as evidence of teacher performance in teacher appraisals and feedback.

The second component consists of parental feedback on teaching and relations with

colleagues. Since these criteria are not directly related to professional outcomes expected of

a teacher, I have named them as “others.” The component loadings showed as 0.56 and 0.51

for the two criteria respectively. The Table G3 in Appendix G gives scores for each variable

in the two components. These scores were predicted through regression.

74

The second component analysis was run on outcomes of teacher evaluation reported

as teacher percentages in OECD (2009b). A total of 13 variables (see Appendices D-E) were

subjected to this procedure. I dropped two variables—teaching students with special

learning needs and teaching in a multicultural setting—for the reason that I mentioned in the

first component analysis on teacher appraisal criteria and feedback. The TALIS 2008 asked

teachers as to what extent their appraisal and feedback directly led to or involved changes in

the 13 aspects of their professional lives. These included such aspects as their salary,

financial rewards, and teaching knowledge and skills. Teachers responded on a scale of 1-5

with 1 representing “no change” and 5 representing “a large change.” OECD (2009b) gives

the last two responses, “a moderate change” and “a large change” as percentages of teachers

who reported so. Before running the component analysis, a simple correlation was run on

these variables. Like the high correlations among some variables in the teacher appraisal

criteria, results on teacher evaluation outcomes also showed some of the variables as highly

correlated with each other. Therefore, 7 of the 13 variables were dropped and component

analysis was run on only six variables capturing information on teacher appraisal outcomes

and impact.

The Table H1 in Appendix H gives results of the principal component analysis for

these variables. One component is retained that carried an EV of 4.41 with 74% of the

variance explained by all the six variables. The Table H2 in Appendix H gives factor

rotations for the retained component. The Table H2 shows that the variables loaded

differently with loadings ranged between 0.34 and 0.45. These variables include how

teacher appraisal and feedback impacted teachers with respect to the emphasis placed on

improving student test scores, a change in salary, career advancement, public recognition,

75

professional development opportunities, and teachers’ role in school development. This

component has been named as “outcomes and impacts of teacher evaluation.” The Table H3

gives scores predicted through regression.

Methods

This study employs Ordinary Least Squares (OLS) as the method of analyzing the

data. Originally, the study was conceived as a 2 and 3-level modeling using the statistical

software package Stata®. However, some logistical issues related to computing resources

and technicalities hindered the use of multilevel modeling in the software in which the

researcher was trained. One of the alternatives was to use the OLS but it had its own

challenges.

PISA being a large scale international survey has a sample with multilevel structure,

and therefore poses challenges as regards meeting the basic assumption of the independence

of observations. The OLS can give unbiased estimates with correct standard errors only

when we have a truly random sample and that observations are independent. Thus, one

possible caveat of using OLS could have been the dependence of observations due to the

multilevel structure of the data where the observations within a strata (e.g., school) may be

dependent in some aspects. This would have violated the independence of observations.

However, the OLS can still give unbiased estimates by applying special procedures

suggested by OECD (2009c). In their study that makes use of the PISA 2006 dataset, Zhang

and Lee (2011) also propose that the OLS can give unbiased estimates by using weights.

Therefore, the OLS as a choice in this dissertation with necessary weight inclusions was a

logical approach within the limitations of the study but with the needed provision of the

robustness to the models employed.

76

Since the PISA 2009 has a two-level sampling structure, the student sample is not

proportional to population of the same age group in the sample countries. Balanced

Repeated Replication (BRR) method accounts for such technical issues by including

weights at the student and school levels. According to OECD (2009c), “a replicate sample is

formed simply through a transformation of the full sample weights according to an

algorithm specific to the replication method. These methods therefore can be applied to any

estimators – means, medians, percentiles, correlations, regression coefficients…” (p. 74).8

With this provision of BRR, I ran the OLS models using the PVs within the standard

guidelines provided by the OECD (2009c). As mentioned previously, the PVs are five

scores representing five possible values for student performance in each of the three tested

subjects. Since PVs are not the actual student scores, using just a mean of the PVs could

inflate the true statistical parameter of interest. Therefore, to avoid these eventualities, the

statistics of interest which in this case are regression coefficients, are separately calculated

for each of the five PVs in each subject. The reported coefficients are the average of

individual regressions. This gives unbiased estimators of population variance and statistics

(OECD, 2009c).

With all the predictors and interactions described earlier, models were run using

standard procedures recommended by the OECD (2009c). Equations 1-3 below represent

the models that I ran for the three subjects separately:

yi = α0 + α1[Monitoring in Test Language] + α2[Developmental] + α3[High-stakes]

+ ei…………………………………………………………………………...(1)

8 For more on this, read OECD (2009c).

77

yi = α0 + α1[Monitoring in Test Language] + α2[Developmental] + α3[High-stakes]

+ ∑βiXi + ∑δiYi + ∑ηiZi + ei ………………………………………………..(2)

yi = α0 + α1[Monitoring in Test Language] + α2[Developmental] + α3[High-stakes]

+ α4[Interactions] + ∑βiXi + ∑δiYi + ∑ηiZi + ei ……………………………(3)

In the equations above, y is the predicted score for student i in mathematics, science, and

reading. The parenthetical terms represent the main variables that are labeled as “monitoring

in test language,” “developmental,” and “high-stakes.” These terms represent the main

variables that capture teacher monitoring and evaluation practices and purposes in schools.

The variables covering “monitoring in test language” are included only with reading and are

not part of the models for mathematics and science. “ei” shows the error terms associated

with the entire model at the outcome level.

Equation 2 represents model 2 that carries control variables at the student, school,

and country levels in addition to the main variables. The terms ∑βiXi, ∑δiYi, and ∑ηiZi give

sums of coefficients of the control predictors from the three levels. One significant feature

of the models employed is the use of the interaction terms. Equation 3 represents model 3

that carries one additional set of interaction terms in addition to the main and control

variables.

78

Chapter 4. RESULTS AND ANALYSES

In this chapter, I will present results of the study which are based on various

regression models used for the three subjects of mathematics, science, and reading. There

are three different models in each subject. The first model carries only the main predictors.

Model 2 consists of control variables for student, school, and country characteristics in

addition to the main predictors. Model 3 in each subject carries an additional set of

interaction terms along with the main predictors and control variables.

Determinants of Student Achievement in Mathematics

Table 4.1 gives regression results for all the three models in mathematics.

Developmental and high-stakes approaches to teacher evaluation.

Developmental and high-stakes approaches to teacher evaluation in mathematics consisted

of eight variables. According to the Table 4.1, in the absence of control variables, teacher

evaluation with a developmental focus showed a largely negative correlation with student

achievement in mathematics. Principals observing teachers in their classrooms related

negatively (b = -1.456, p = .704) with student achievement. However, this relationship was

not significant. The coefficient for principals suggesting teachers for improvement showed a

large negative coefficient (b = -21.235, p < .001). Similarly, principals informing teachers

for updating their knowledge and skills also showed a relatively large and significant

negative coefficient (b = -14.179, p < .05). Assessments used for improvement in instruction

related positively and significantly to student achievement in mathematics. Student

achievement, in schools using assessments for instructional improvement, associated with

12.393 (p < .01) score point increase in individual student test scores in mathematics in the

absence of interactions and control variables.

79

Table 4.1

Determinants of Student Achievement in Mathematics

(1) (2) (3)

Main predictors With control

variables

With

interactions

Developmental (Principals’ pedagogical

role and use of student assessments for

instructional improvement)

Classroom observations by school

principal

-1.456

(-0.38)

1.345

(0.51) 1.658 (0.61)

Principals suggesting teachers for

improvement

-21.235***

(-4.98)

-3.070

(-1.26)

-3.520

(-1.45)

Principals informing teachers for

updating knowledge and skills

-14.179*

(-2.20)

-8.329*

(-2.41)

-8.271*

(-2.43)

Assessments used for instructional

improvement

12.393**

(2.81)

2.871

(0.81)

2.350

(0.66)

High-Stakes

Public accountability for student

performance

17.474***

(5.00)

9.595***

(4.22)

9.652***

(4.38)

Student assessments used for evaluating

teachers

-21.199***

(-5.77)

-3.403

(-1.50)

-3.686

(-1.63)

Student assessments tracked by an

administrative authority

-7.470*

(-2.17)

-1.707

(-0.76)

-1.772

(-0.80)

Student assessments used for judging

teacher effectiveness

-8.194*

(-2.08)

2.482

(0.83)

2.104

(0.70)

Interactions

Classroom observations given moderate

to high importance in teacher evaluation

x parents are informed about their

children’s progress

0.221***

(4.24)

Classroom observations given moderate

to high importance in teacher evaluation

x principal is responsible for making

salary changes

0.079

(1.89)

80

Table 4.1

Determinants of Student Achievement in Mathematics (continued)

(1) (2) (3)

Main predictors With control

variables

With

interactions

Principal observes classes x

independent private school

-7.526

(-1.25)

Student Controls

Student age -11.251***

(-7.91) -11.045

***

(-7.89)

Girl -18.465***

(-24.84)

-18.509***

(-24.94)

Grade 31.580***

(31.67)

31.430***

(31.82)

Index of social, cultural and economic

status

23.483***

(36.60)

23.380***

(36.16)

First generation immigrant -30.774***

(-12.00)

-30.714***

(-11.90)

Second generation immigrant -22.456***

(-5.95)

-22.564***

(-5.94)

Home language other than test language -6.683***

(-3.78)

-7.303***

(-4.13)

School Controls

Principal’s sex (female) -12.379***

(-5.37)

-11.979***

(-5.24)

Public school -15.906***

(-5.97)

-16.843***

(-5.15)

School size 0.000

(0.31)

0.001

(0.45)

Teacher shortage -4.014***

(-3.88)

-3.875***

(-3.70)

Proportion of qualified teachers 1.784

(0.54)

1.664

(0.50)

81

Table 4.1

Determinants of Student Achievement in Mathematics (continued)

(1) (2) (3)

Main predictors With control

variables

With

interactions

Proportion of girls 0.166**

(2.82)

0.166**

(2.84)

Student teacher ratio -0.498***

(-5.66)

-0.490***

(-5.58)

Proportions of computers connected to

Web

17.840***

(4.25)

17.648***

(4.22)

Country Controls

Professional outcomes (e.g., student test

scores, retention and pass rates) as

teacher evaluation criteria

-13.499***

(-17.62)

-14.003***

(-17.09)

Others (Feedback from parents,

relations with colleagues) as teacher

evaluation criteria

-5.624***

(-6.00)

-5.855***

(-6.34)

Outcomes and impact of teacher

evaluation

3.891***

(5.53)

3.255***

(4.30)

Dollars spent on education -0.366

(-1.74)

-0.393

(-1.88)

_cons 491.487***

(81.87)

674.023***

(28.03)

654.823***

(27.07)

N 210307 210307 210307

Average R2 0.079 0.422 0.424

t statistics in parentheses

* p < 0.05,

** p < 0.01,

*** p < 0.001

With the introduction of control variables at the student, school, and country levels

and the interaction terms, the main predictors in the developmental category behaved

differently than in the absence of such variables. The direction of relationship for principals

observing classes changed to positive with a beta value of 1.345 though it still remained

82

insignificant (p = .613). With the interaction terms in model 3, this variable remained almost

unchanged (b = 1.658, p = .543). Principals suggesting teachers for improvement reduced to

an insignificant negative correlation in the presence of control variables in model 2 (b = -

3.070, p = .211) and with the interactions in model 3 (b = -3.520, p = .152). The variable

that captured information on the use of assessments for instructional improvement also

reduced to a small positive coefficient of 2.871 (p = .420) in model 2 with control variables

and with a coefficient of 2.350 (p = .511) with the interaction terms in model 3. The

direction of the relationship for principals informing teachers for updating their knowledge

and skills remained negative and significant with a reduced negative coefficient (b = -8.329,

p < .05) compared to the model 1. With the interaction terms in the model 3, the coefficient

remained significant (b = -8.271, p < .05) at 5% level.

With regard to high-stakes approaches to teacher evaluation, all but one variable

related negatively with student achievement in mathematics in the absence of control

variables and interaction terms. Table 4.1 shows that public accountability such as

publishing student performance in the media and other outlets associated with a higher

student achievement. It associated with an increase of 17.474 (p < .001) score points in

student achievement in mathematics. However, student assessments as used for evaluating

teachers (as in formal evaluations), tracking of student assessment by administrative

authorities, and judging teacher effectiveness did not relate positively with student

achievement. Using student assessments for teacher evaluation resulted in a negative

relation (b= -21.199, p < .001) as did tracking of assessments by an administrative authority

(b = -7.470, p < .05), and using student assessments for judging teacher effectiveness (b = -

8.194, p < .05).

83

Like the change in behavior of the variables in the developmental category, the high-

stakes approaches to teacher evaluation also recorded a change when control variables and

interaction terms were added successively in models 2 and 3. After controlling for the

background factors at the student, school, and country levels, public accountability still

remained a significant and positive relation with student achievement in mathematics but the

effect size reduced significantly with a coefficient of 9.595 (p < .001). The negative

coefficients in the use of student assessments for evaluating teachers (b = -3.403) and

tracking of student assessments by an administrative authority (b = -1.707) reduced in their

sizes and turned insignificant with p-values of .136 and .447 respectively. Using student

assessments for judging teacher effectiveness turned positive (b = 2.482) but remained

insignificant (p = .411) when controlled for background factors.

With further introduction of the interaction terms and after controlling for the

background factors, public accountability persisted as a significant predictor with a

coefficient of 9.652 (p < .001) at 0.1% level of significance. The use of student assessments

for evaluating teachers showed a negative but insignificant relation with student

achievement in mathematics with a beta coefficient of -3.686 (p = .110). Similarly, tracking

of student assessments by an administrative authority also returned a statistically

insignificant negative coefficient of -1.772 (p = .424). Student assessments used for judging

teacher effectiveness showed a positive (b = 2.104) but insignificant (p = .489) coefficient.

Model 3 also explored three important interaction terms. It explored how a higher

importance given to classroom observations in teacher evaluation interacted with informing

parents about the progress of their children, and with the principal being able to make

changes in teachers’ salaries. These two interactions were cross-level interactions with one

84

level being the school and the other being the country. The model also analyzed interaction

between principals’ observation of classes and school being private. Results show that a

higher importance placed on classroom observations in combination with informing parents

about the progress of their children carried a significant coefficient of 0.221 (p < .001) after

controlling for factors at the student, school and country levels. Higher importance given to

the classroom observations in teacher appraisals interacted positively with principals’

authority to make changes in teachers’ salaries with a coefficient of 0.079 (p = .062).

However, given the large size of the sample, this coefficient is treated as statistically

insignificant in the context of this study. Principals observing teachers in their classes

interacted negatively (b = -7.526) with the school type being independent private. This

relationship remained insignificant with a p-value of .216.

Control variables in models 2 and 3 in mathematics. As stated earlier, models 2

and 3 carried control variables at the student, school, and country levels in addition to the

main predictors. The Table 4.1 gives coefficients for these control variables in mathematics.

As expected, all control variables except for age behaved similarly as in previous studies

(e.g., Demir, Kılıç, & Ünal, 2010; Fuchs & Wößmann, 2007; Zhang & Lee, 2011). Age

related negatively with student achievement in mathematics with coefficients of -11.251 (p

< .001) in model 2 and -11.045 (p < .001) in model 3. Age turned negative only when grade

and other control variables are added into the model. This anomalous behavior requires

further probing of the relationships of age with student achievement. Being a girl appeared

to be a disadvantage in mathematics. The negative relationship is consistent across the two

models with almost the same coefficient sizes of around -18.500 (p < .001) at 0.1%

significance levels. Grade associated positively with student achievement with coefficients

85

of about 31 (p < .001) in the two models. Similarly, socioeconomic status also associated

strongly and positively with student achievement across the two models with coefficients of

23.483 (p < .001) and 23.380 (p < .001) in models 2 and 3 respectively. First and second

generation immigrant statuses as well as home language being other than test language

showed consistent negative relations with student achievement in mathematics. First

generation immigration status produced a coefficient of -30.774 (p < .001) and -30.714 (p <

.001) in models 2 and 3 respectively. Second generation immigrant status showed similar

disadvantages for students in terms of their achievement in mathematics. The relationship

was somewhat less intense compared to the first generation immigrant status with

coefficients of around -22 at 0.1% significance levels in models 2 and 3. If the language at

home was different than the test language, it related negatively to student achievement in

mathematics by a factor of about 7 (p < .001) points.

Results on various school attributes attested to the earlier findings from various

studies (Demir, Kılıç, & Ünal, 2010; Fuchs & Wößmann, 2007; Zhang & Lee, 2011). Being

in a public school appeared as a disadvantage in model 2 (b = -15.906, p < .001) and model

3 (b = -16.843, p < .001). Similarly, student achievement in schools with a female principal

reflected in coefficients of -12.379 (p < .001) and -11.979 (p < .001) in models 2 and 3

respectively suggesting a net disadvantage for students in terms of their achievement. Size

of the school also mattered significantly but with a small effect size across the two models.

A unit increase in school enrollment associated with an increase of 0.001 (p < .001) score

point in student achievement in mathematics. Teacher shortage showed negative and

significant coefficients in both the models. A shortage of teachers associated with a decrease

of about 4 (p < .001) score points in student achievement in mathematics. The proportion of

86

qualified teachers associated positively but with insignificant coefficients of 1.784 (p =

.592) and 1.664 (p = .617) in models 2 and 3 respectively. This finding, though insignificant

but positive, somewhat supports the previous evidence on positive association of teacher

quality with student achievement (Darling-Hammond, 2000). However, some evidence also

shows that teacher quality in the form of observable characteristics is not associated with

higher student achievement (Buddin & Zamarro, 2009; Hanushek et al., 2005; Harris &

Sass, 2011). These studies suggest that while teacher quality is an important determinant in

student achievement, observable characteristics such as an advanced diploma does not relate

positively to student achievement. Thus, this coefficient somewhat goes in line with the

former evidence (e.g. Darling-Hammond, 2000) suggesting that having ISCED level 5A

qualification is associated positively, though statistically insignificantly, with better student

achievement. A higher proportion of girls carried a positive coefficient of 0.166 (p < .01) in

both the models 2 and 3. This runs counter to the earlier coefficient where girls scored less

than boys in mathematics by which it should mean that a higher proportion of girls should

relate negatively with student achievement in mathematics. This result may be construed of

as an outcome of an environment where boys and girls may enter into competition with each

other for better scores which may then be resulting in an overall increase in student

achievement in mathematics. A higher student-teacher ratio resulted in a decrease in student

achievement suggesting a somewhat negative effect of large class sizes. A school that had a

higher proportion of computers connected to the Internet experienced a positive student

achievement with coefficients of about 18 (p < .001) after controlling for background factors

at the student, school, and country levels.

87

Models 2 and 3 also carried 4 control variables at country level in addition to the

main predictors and the interactions. Dollars spent on education associated with a decrease

of about 0.4 (p < .1) score point in student achievement for every 100 dollar increase in

spending on education per capita income. However, this relationship was not significant

within the context of this study. This could possibly be a result of the non-random sample of

countries where the countries were selected on pre-defined criterion which was based on

their participation in the TALIS 2008 survey. As described in the sections on the variables

and data reduction in chapter 3, three country variables were created using information on

teacher appraisal and feedback practices from the OECD (2009b) report on the TALIS 2008.

The first component of teacher evaluation which is named as “professional outcomes” was

associated negatively with student achievement with coefficients of -13.499 (p < .001) in

model 2 and -14.003 (p < .001) in model 3. This negative association raises important

questions and concerns with regard to the use of student test scores and retention and pass

rates as evidence of teacher performance in teacher evaluations. The second component

which is named as “others” showed significant negative coefficients of -5.624 (p < .001) and

-5.855 (p < .001) in models 2 and 3 respectively. This negative association may possibly be

due to a potential mismatch between the relative emphases that teachers and evaluators

place on these criteria in teacher appraisals. For example, teachers may value their relations

with colleagues as a highly important criterion in their appraisals but principals and other

evaluators may have a different opinion on this. This difference in the relative importance of

teacher evaluation criteria between teachers and evaluators may give rise to conflicts of

interest leading to an overall negative impact on student test scores. However, the

underlying dynamics may be much more complex than such a straightforward explanation.

88

Along with these two components as control variables for teacher evaluation criteria,

one component on teacher evaluation outcomes was used as a country variable to control for

teacher perspectives on the subject. This component is named as the “outcomes and impact

of teacher evaluation.” It showed a significant positive association with student achievement

in mathematics with coefficients of 3.891 (p < .001) in model 2 and 3.255 (p < .001) in

model 3. This positive association suggests a “high-stakes” effect in mathematics at play

where teachers see that their salaries are at stake in their evaluations. This could also mean

that a better performance in the form of student test scores could secure advancement in

career and a spot in a professional development opportunity and hence a positive incentive

for teachers to work harder to produce better student test scores.

Determinants of Student Achievement in Science

Like mathematics, student achievement in science was subjected to the same

regression analyses using the three models as specified in mathematics. The results are

similar across the two subjects with some exceptions. The Table 4.2 gives regression results

for all three models in science.

Developmental and high-stakes approaches to teacher evaluation.

Developmental approaches to teacher evaluation repeat similar behavior as in mathematics.

Principals observing teachers in their classrooms related negatively though insignificantly (b

= -1.999, p = .555) with student achievement in science. The coefficient for principals

suggesting teachers for improvement showed a negative but a larger coefficient (b = -

17.897, p < .001) than in mathematics. Similarly, principals informing teachers about

possibilities for updating their knowledge and skills also showed a negative relation with

student achievement though with a smaller and insignificant coefficient (b = -7.287, p =

89

Table 4.2

Determinants of Student Achievement in Science

(1) (2) (3)

Main predictors With control

variables

With interactions

Developmental (Principals’

pedagogical role and use of

student assessments for

instructional improvement)

Classroom observations by

school principal

-1.999

(-0.59)

0.769

(0.33)

0.894

(0.38)

Principals suggesting

teachers for improvement for

improvement

-17.897***

(-4.65)

-2.348

(-1.08)

-2.658

(-1.22)

Principals informing teachers

for updating knowledge and

skills

-7.287

(-1.20)

-5.444

(-1.82)

-5.390

(-1.81)

Assessments used for

instructional improvement

8.910*

(2.37)

-0.176

(-0.06)

-0.517

(-0.18)

High-Stakes

Public accountability for

student performance

16.521***

(5.30)

8.710***

(4.40)

8.794***

(4.46)

Student assessments used for

evaluating teachers

-20.531***

(-6.37)

-3.903

(-1.96)

-4.248*

(-2.14)

Student assessments tracked

by an administrative authority

-5.537

(-1.86)

0.721

(0.36)

0.657

(0.33)

Student assessments used for

judging teacher effectiveness

-6.471

(-1.81)

2.797

(1.00)

2.481

(0.88)

Interactions

Classroom observations

given moderate to high

importance in teacher

evaluation x parents are

informed about their

children’s progress

0.178**

(3.58)

90

Table 4.2

Determinants of Student Achievement in Science (continued)

(1) (2) (3)

Main predictors With control

variables

With interactions

Classroom observations

given moderate to high

importance in teacher

evaluation x principal is

responsible for making salary

changes

0.107**

(2.64)

Principal observes classes x

independent private school

-6.023

(-1.20)

Student Controls

Student age

-11.384***

(-8.58)

-11.243***

(-8.53)

Girl

-7.257***

(-10.34)

-7.308***

(-10.44)

Grade

31.963***

(35.82)

31.828***

(35.36)

Index of social, cultural and

economic status

22.198***

(41.36)

22.036***

(40.35)

First generation immigrant -28.570***

(-11.06)

-28.589***

(-11.00)

Second generation immigrant -21.166***

(-6.07)

-21.220***

(-6.06)

Home language other than

test language

-13.620***

(-6.71)

-14.260***

(-7.04)

School Controls

Principal’s sex (female) -7.747***

(-4.03)

-7.330***

(-3.79)

Public school -15.754***

(-7.07)

-15.492***

(-5.74)

School size 0.001

(0.62)

0.001

(0.85)

91

Table 4.2

Determinants of Student Achievement in Science (continued)

(1) (2) (3)

Main predictors With control

variables

With interactions

Teacher shortage -4.721***

(-5.23)

-4.555***

(-4.97)

Proportion of qualified

teachers

10.431**

(3.10)

10.588**

(3.16)

Percent girls 0.190***

(4.42)

0.189***

(4.48)

Student teacher ratio -0.527***

(-5.75)

-0.515***

(-5.66)

Proportions of computers

connected to Web

23.850***

(6.26)

23.531***

(6.21)

Country Controls

Professional outcomes (e.g.,

student test scores, retention

and pass rates) as teacher

evaluation criteria

-10.290***

(-15.33)

-10.524***

(-14.51)

Others (Feedback from

parents, relations with

colleagues) as teacher

evaluation criteria

-0.553

(-0.60)

-0.853

(-0.93)

Outcomes and impact of

teacher evaluation

1.613*

(2.53)

0.840

(1.20)

Dollars spent on education -0.351*

(-2.01)

-0.335

(-1.86)

_cons 490.640***

(89.54)

661.368***

(29.81)

644.463***

(28.05)

N 210307 210307 210307

Average R2 0.072 0.405 0.406

t statistics in parentheses

* p < 0.05,

** p < 0.01,

*** p < 0.001

92

.232). The fourth variable, assessments used for improvement in instruction, related

positively and significantly to achievement in science with a coefficient of 8.910 (p < .05).

Change of behavior in the main predictors in Model 2 was similar to that in

mathematics when control variables were introduced into the model specifications. In the

developmental category of teacher evaluation, direction of relationship in principals

observing classes shifted to positive but remained insignificant with a beta value of 0.769 (p

= .741). Principals suggesting teachers for improvement showed a negative and statistically

insignificant (b = -2.348, p = .283) correlation with student achievement in science.

Principals informing teachers for updating their knowledge and skills remained negative (b

= -5.444, p = .072) and insignificant. Furthermore, the use of assessments for instructional

improvement also showed negative and insignificant (b = -0.176, p = .950). Introduction of

interaction terms in model 3 resulted in no major difference in coefficients in the main

predictors in developmental teacher evaluation. Classroom observations by principals still

showed as insignificant (b = 0.894, p = .703). Principals informing teachers for updating

their knowledge and skills remained negative (b = -5.390) and statistically insignificant (p =

.074). Similarly, principals suggesting teachers for improvement (b = -2.658, p = .225), and

using assessments for instructional improvement (b = -0.517, p = .854) also showed

negative and insignificant associations with student achievement in science.

In the absence of control variables, the category consisting of the high-stakes

approaches to teacher evaluation in model 1 repeated a similar behavior as in mathematics.

As the Table 4.2 shows, public accountability related positively with student achievement in

science with a coefficient of 16.521 (p < .001) which is almost the same as in model 1 in

mathematics. The different uses of student assessments resulted in negative correlations

93

with achievement in science. The use of student assessment for teacher evaluation showed a

negative relation (b= -20.531, p < .001) with student achievement. On the contrary, tracking

of student assessments by administrative authority (b = -5.537, p = .067) and using student

assessments for judging teacher effectiveness (b = -6.471, p = .074) remained statistically

insignificant.

Upon introduction of control variables in model 2, public accountability persisted as

a significant and positive relation with student achievement in science and, like

mathematics, delivered a reduced effect size with a coefficient of 8.710 (p < .001). A

negative and insignificant coefficient was observed in the use of student assessments for

evaluating teachers (b = -3.903, p = .054). Tracking of student assessments by an

administrative authority (b = 0.721, p = .717) and using student assessments for judging

teacher effectiveness turned positive (b = 2.797, p = .321) but still remained insignificant.

Interaction terms did not greatly affect the direction and size of any relationships in

science. With further introduction of interactions in model 3, public accountability related

significantly with student achievement in science with a coefficient of 8.794 (p < .001). The

use of student assessments for evaluating teachers reflected in a negative relation with a

significant beta coefficient of -4.248 (p < .05). Administrative tracking of student

assessments produced a coefficient of 0.657 (p = .741) whereas judging teacher

effectiveness through student assessments showed positive (b = 2.481, p = .381) but

insignificant associations with student achievement in science in model 3.

With regard to the interaction terms, results indicated that the four interaction terms

behaved similarly as in model 3 in mathematics. Classroom observations being important as

a criteria in teacher appraisal carried a significant coefficient of 0.178 (p < .01) when

94

interacted with parents being informed about the progress of their children. In a similar vein,

higher importance given to classroom observations in teacher evaluation criteria showed a

positive interaction with principals’ ability to make changes in teachers’ salaries by giving a

coefficient of 0.107 (p < .01). In contrast, principals’ observation of teachers delivered a

negative but insignificant interaction with school type being private by producing a

coefficient of -6.023 (p = .235).

Control variables in models 2 and 3 in science. The Table 4.2 gives coefficients

for control variables in science. All control variables behaved in a similar fashion as in

mathematics. Coefficient sizes were in the same range as in mathematics with only few

exceptions. Being a girl still remained a disadvantage but the size of the negative coefficient

reduced significantly. The effect size remained consistent across the two models with almost

the same coefficient sizes of about -7 (p < .001) as in mathematics. First and second

generation immigrant statuses as well as home language being other than test language

showed consistent negative relation with student achievement in science. A second

exception in terms of size of coefficients was observed in home language being other than

test language. A student who spoke a language at home different than the test language

suffered a negative consequence as reflected in a coefficient of -14.260 (p < .001) in the

model 3. This coefficient was larger in size (almost double) in science than in mathematics.

The school level control variables also behaved similarly as in mathematics with few

exceptions. Having a female as a school principal showed as somewhat less strongly as a

negative association compared to mathematics. The coefficient showed as -7.330 (p < .001)

in model 3 which was about 5 points less than in mathematics. Proportion of qualified

teachers associated positively with student achievement with significant coefficients unlike

95

mathematics where the coefficients were positive but insignificant. This variable associated

with a 10.588 (p < .01) score point increase in individual student test scores in science. A

higher proportion of girls showed a positive coefficient of about 0.19 (p < .001) in both the

models. The size of coefficient for proportion of computers showed a stronger correlation

compared to mathematics by a factor of about 4 points at 0.1% significance level.

In terms of the country control variables, the relationships were similar as in

mathematics. “Professional outcomes” as a country control variable for teacher evaluation

criteria showed as negatively associated with student achievement but with a reduced

coefficient as compared to the same coefficient in mathematics. This coefficient showed as -

10.290 (p < .001) in model 2 and as -10.524 (p < .001) in model 3 which was about 4 points

less than the same coefficients in mathematics. The second variable consisting of parental

feedback on teaching and relations with colleagues showed an insignificant (b = -0.853, p =

.357) association with student achievement in science in the model with the control

variables and the interaction terms. The third country control variable “outcomes and impact

of teacher evaluation” resulted in a positive but insignificant association (b = 0.840, p =

.235) with student achievement in science. Like mathematics, dollars spent on education

also showed a negative insignificant association with student achievement in science with a

coefficient of -0.335 (p = .066).

Determinants of Student Achievement in Reading

The only stark difference in terms of model specifications between reading and the

prior two subjects was the inclusion of a set of main predictors covering teacher monitoring

in test language. Given the additional emphasis placed in the PISA 2009 survey on reading,

the first model in reading consisted of additional four variables on practices in teacher

96

monitoring in test language. Regression analyses in reading were similar as in mathematics

and science with some exceptions. The Table 4.3 gives regression results for all three

models in reading.

Developmental and high-stakes approaches to teacher evaluation. The first block

in Table 4.3 shows four approaches to gathering and using evidence in monitoring teachers

in test language. These approaches included student achievement, peer reviews, principal

and staff observations, and observations by an external authority.

Results indicated that using student achievement as an evidence of teacher

performance in teacher monitoring established a positive relation with student achievement

in reading. The significance level of the coefficient became progressively stronger across the

three models. However, the size of the coefficient dropped slightly between models 1 and 3

as control variables and interaction terms were introduced in successive models. In model 1

without control variables, it showed a value of 8.411 (p < .05). This value dropped to 7.455

(p < .01) when background factors were controlled in model 2. The coefficient registered a

further marginal drop in model 3 when interactions were introduced (b = 7.421, p < .01) into

the model. Similarly, teacher peer reviews also carried a positive relation with student

achievement in reading with coefficients of 6.841 (p = .065), 2.549 (p = .392), and 2.603 (p

= .388) in models 1, 2, and 3 respectively. This relationship showed as statistically

insignificant across the three models. Principal and staff observations showed significant at

α-level of 0.1% with large positive relationship (b = 13.478, p < .001) with student

achievement in reading in the absence of control variables and interactions. With the

introduction of control variables in model 2, this relationship dropped significantly in size

with a coefficient of 6.931 (p < .01) in model 2 with control variables. When the interaction

97

Table 4.3

Determinants of Student Achievement in Reading

(1) (2) (3)

Main predictors With control

variables

With interactions

Developmental

Teacher evaluation in test language

Student achievement 8.411*

(2.40)

7.455**

(3.23)

7.421**

(3.21)

Peer reviews

6.841

(1.87)

2.549

(0.86)

2.603

(0.87)

Principal and staff observations

13.478***

(3.81)

6.931**

(2.62)

6.601*

(2.50)

Observations by external authority -3.761

(-1.21)

1.771

(0.85)

1.752

(0.84)

Principals’ pedagogical role and use of

student assessments for instructional

improvement

Classroom observations by school

principal

-8.411*

(-2.56)

-2.443

(-1.02)

-2.242

(-0.92)

Principals suggesting teachers for

improvement

-13.259***

(-3.58)

-0.962

(-0.43)

-1.160

(-0.52)

Principals informing teachers for

updating knowledge and skills

-6.612

(-1.15)

-5.310

(-1.92)

-5.288

(-1.92)

Assessments used for instructional

improvement

5.106

(1.31)

-1.316

(-0.43)

-1.538

(-0.51)

High-Stakes

Public accountability for student

performance

14.832***

(4.83)

8.621***

(4.31)

8.690***

(4.35)

Student assessments used for

evaluating teachers

-21.498***

(-6.30)

-4.800*

(-2.27)

-5.012*

(-2.37)

Student assessments tracked by an

administrative authority

-7.226*

(-2.44)

-1.467

(-0.72)

-1.483

(-0.73)

98

Table 4.3

Determinants of Student Achievement in Reading (continued)

(1) (2) (3)

Main predictors With control

variables

With interactions

Student assessments used for judging

teacher effectiveness

-5.966

(-1.73)

0.406

(0.16)

0.211

(0.08)

Interactions

Classroom observations given

moderate to high importance in

teacher evaluation x parents are

informed about their children’s

progress

0.121**

(2.47)

Classroom observations given

moderate to high importance in

teacher evaluation x principal is

responsible for making salary

changes

0.078*

(2.15)

Principal observes classes x

independent private school

-4.399

(-0.90)

Student Controls

Student age

-11.669***

(-7.77)

-11.595***

(-7.77)

Girl

26.132***

(39.37)

26.095***

(39.22)

Grade

37.181***

(37.90)

37.074***

(37.63)

Index of social, cultural and

economic status

22.223***

(39.34)

22.109***

(38.85)

First generation immigrant -29.860***

(-11.33)

-29.904***

(-11.33)

Second generation immigrant -24.622***

(-7.05)

-24.655***

(-7.04)

99

Table 4.3

Determinants of Student Achievement in Reading (continued)

(1) (2) (3)

Main predictors With control

variables

With interactions

Home language other than test

language

-17.143***

(-8.86)

-17.585***

(-9.20)

School Controls

Principal’s sex (female)

-7.368***

(-3.76)

-7.089***

(-3.61)

Public school

-16.885***

(-7.44)

-16.705***

(-5.87)

School size 0.004**

(2.97)

0.004**

(3.14)

Teacher shortage -3.474***

(-4.09)

-3.340***

(-3.91)

Proportion of qualified teachers 6.395*

(2.12)

6.516*

(2.17)

Percent girls 0.226***

(5.17)

0.225***

(5.21)

Student teacher ratio -0.448***

(-5.61)

-0.440***

(-5.53)

Proportions of computers connected

to Web

20.880***

(5.25)

20.658***

(5.22)

Country Controls

Professional outcomes (e.g., relations

with students, parental feedback) as

teacher evaluation criteria

-7.332***

(-10.72)

-7.511***

(-10.46)

Others (Feedback from parents,

relations with colleagues) as teacher

evaluation criteria

-4.346***

(-4.22)

-4.559***

(-4.45)

Outcomes and impact of teacher

evaluation

-0.770

(-1.18)

-1.296*

(-1.87)

100

Table 4.3

Determinants of Student Achievement in Reading (continued)

(1) (2) (3)

Main predictors With control

variables

With interactions

Dollars spent on education

-0.401*

(-2.09)

-0.393*

(-1.98)

_cons 479.500***

(86.79)

646.321***

(26.21)

635.225***

(25.11)

N 210307 210307 210307

Average R2 0.078 0.419 0.420

t statistics in parentheses

* p < 0.05,

** p < 0.01,

*** p < 0.001

terms were added in model 3, significance of this relationship changed thereby becoming

significant at 5% α-level with a coefficient of 6.601 (p < .05). Observations by an external

authority resulted in no significant relationship with student achievement in reading taking

into account other factors at the student, school, and country levels and with the introduction

of interaction terms into the model. In the final model, it produced a coefficient of 1.752 (p

= .401).

Developmental approaches to teacher evaluation repeated similar behavior as in

mathematics and science. All variables showed insignificant negative associations with

student achievement in reading in models 2 and 3. In the high-stakes approaches to teacher

evaluation, public accountability related positively with student achievement with a

coefficient of 14.832 (p < .001) without control variables in model 1. Like mathematics and

science, it registered a consistent positive coefficient across models 2 and 3 when

background factors were taken into account. In model 3 it reported a coefficient of 8.690 (p

< .001) after controlling for student, school, and country factors and with the inclusion of

101

the interactions. Similarly, different uses of student assessments resulted in negative

correlations with achievement in reading at different significance levels. Using student

assessment for teacher evaluation developed a negative association with student

achievement in reading in all three models. In model 3, it showed a negative beta of -5.012

(p < .05). Administrative tracking of student assessments also showed a negative coefficient

in all three models but it remained insignificant with a p-vale of .469 in the presence of

control variables and interaction terms in the model. The use of student assessments for

judging teacher effectiveness also remained statistically insignificant in model 3.

Relationships between the interaction terms and student achievement in reading

remained similar as in mathematics and science. Classroom observations being important as

criteria in teacher appraisal interacted positively with parents being informed about the

progress of their children. This interaction showed a significant coefficient of 0.121 (p <

.01). Similarly, interaction between classroom observations given moderate to high

importance in teacher evaluation criteria also showed a positive correlation (b = 0.078, p <

.05) with student achievement in reading when interacted with principals’ authority in

making changes in teachers’ salaries. Like mathematics and science, the third interaction

term consisting of principals’ observation of classes and school type being private returned

an insignificant negative association.

Control variables in models 2 and 3 in reading. The Table 4.3 gives coefficients

for the control variables in reading. All control variables at different levels behaved in a

similar fashion as in mathematics and science with coefficient sizes in similar ranges and

directions with only few exceptions. The exception was seen in student sex where being a

girl turned into a big advantage with strong positive coefficients in both the models 2 and 3.

102

In model 3, the coefficient delivered as 26.095 (p < .001). Student grade also showed a

significantly larger association with achievement in reading compared to science and

mathematics. The coefficient for grade (b = 37.074, p < .001) in reading was about 6 points

higher than in mathematics and science. Immigration statuses and home language being

other than test language returned similar results as in mathematics and science. Student

achievement related negatively to these attributes of students with large negative

coefficients turning significant at 0.1% significance levels in all three variables.

In school level control variables, principal’s sex persisted as a disadvantage for

students with a coefficient of -7.089 (p < .001) in model 3. Other school characteristics such

as proportion of qualified teachers, proportion of girls, student teacher ratio, and proportion

of computers connected to the Internet all repeated similar behaviors as in mathematics and

science with slight variations in the sizes of the coefficients.

Country control variables also followed similar patterns as in mathematics and

science. For example, dollars spent on education returned a coefficient with a value of -

0.393 (p < .05) in model 3. However, unlike mathematics and science, this association

showed significant at 5% significance level. Similarly, “professional outcomes” showed a

coefficient of -7.332 (p < .001) in model 2 and -7.511 (p < .001) in model 3. Like in

mathematics, the second component “others” as country control variable showed negative

and significant in models 2 and 3. However, unlike mathematics and science, “outcomes and

impact of teacher evaluation” as a country variable showed as negative (b = -1.296, p < .05)

and significant at 5% of α-level.

In conclusion, these findings have implications for the hypotheses that this study

aimed to examine. The study did not find sufficient evidence to clearly refute any of the

103

three null hypotheses. As findings show, hypothesis 1 received mixed results.

Developmental approaches to teacher evaluation that included principals’ observation of

classes, principals suggesting teachers for improvement, principals informing teachers about

possibilities for updating their knowledge, and the use of student assessments for

instructional improvement did not relate significantly with student achievement in all three

subjects. At the same time, use of student achievement and principal and staff observations

to monitor teachers in test language showed as positively associated with student

achievement in reading.

Second, with regard to hypothesis 2, only public accountability showed a significant

positive association with student achievement in all three subjects after controlling for

socioeconomic and other background factors. The other three approaches to teacher

evaluation in this category did not relate significantly with student achievement in

mathematics. In science and reading, using student assessments for evaluating teachers

showed a significant negative association with student achievement at 5% significance level.

Similarly, hypothesis 3 is also only partially supported by the findings of the study.

Interaction between observations of classes being important in teacher appraisals interacted

positively and significantly with parents being informed of the progress of their children

across all three subjects. In contrast, in reading and science only, classroom observations

being important in teacher appraisals interacted positively and significantly with principals’

authority in making changes in teachers’ salaries. This interaction remained statistically

insignificant in mathematics. Classroom observations by principals showed insignificant

negative interaction with school type being private across all three subjects.

104

Chapter 5. DISCUSSION, IMPLICATIONS, AND CONCLUSIONS

The issue of how best to monitor and evaluate teachers to achieve optimal student

learning outcomes for all students has been a subject of sustained and heated debates and

policy-making around the world. The intensity of such debates only gains momentum when

conflicting evidence on alternative approaches to monitor and evaluate teachers comes forth

in different studies and from varied schools of thought. For example, objective measures of

teacher performance in the form of summative evidence such as student test scores in VAM

approaches are considered to offer better tradeoffs in terms of their objectivity (Goldhaber &

Hansen, 2010; Sanders & Horn, 1994; Stronge & Tucker, 2000). On the contrary, evidence

on the subjective as well as standardized approaches to measuring teacher effectiveness

highlights the importance of quality of classroom processes as plausible measures of teacher

performance in schools (Kimball et al., 2004; Holtzapple, 2003; Milanowski, 2004; Sartain

et al., 2011; Wenglinsky, 2002; White, 2004). This latter body of evidence proposes that the

developmental approaches to monitoring and evaluation enable educators to deeply reflect

on their practice, identify areas that need improvement, and hence produce better learning

outcomes for students. This study has only added to the growing body of evidence on the

subject without presenting any conclusive standpoint on the efficacy of either approach to

monitoring and evaluating teachers.

In this chapter, I will discuss the findings of the study in the light of prior evidence. I

will also discuss policy implications for current debates on alternative forms of teacher

monitoring and evaluation, explain limitations of the study, and present recommendations

for further research.

105

Before discussing the findings of the study, it is important to expose at the outset

some limitations of the study to enable the readers to make meaningful generalizations

beyond the scope of this study. While I will discuss in detail these limitations near the end

of the chapter, I will briefly mention here that there are two major limitations of the study.

First, the study looks at student achievement only in the form of student test scores as

reflected in the PISA tests on reading, mathematics, and science. Since student learning is

an all-encompassing concept, student achievement only in the form of student test scores

presents a limited view of student learning. Student test scores preclude a holistic view of

student learning by focusing only on the cognitive domains and ignoring others such as

social and emotional domains. Therefore, any relation between student achievement and

teacher monitoring and evaluation practices and purposes should be looked at only in the

form of student test scores as reflected in the PISA tests in the three subjects of

mathematics, science, and reading. The second major limitation of the study stems from the

study models. The study has explored relationships between student achievement and

teacher monitoring and evaluation purposes and practices in a pooled sample of 21 countries

at one level—student. It needs to be noted that having an aggregate sample may be tricky

given the complex sampling structure in the PISA 2009 survey. The one-level model

explores variation only among students and overlooks variation among schools and

countries. It is with these major limitations that this study hopes to add to the increasing

evidence on the subject by discussing the pay-offs of alternative forms of teacher evaluation

in cross-national perspectives.

106

Developmental Approaches to Teacher Evaluation

This study examined teacher monitoring and evaluation with developmental

purposes in three main dimensions. First, it analyzed how teacher monitoring in test

language relates to student achievement in reading. Second, it explored how a principal’s

evaluative focus as a pedagogical leader relates to student achievement in the three subjects.

Third, the study looked at how the use of student assessments for instructional improvement

relates to student achievement in the three subjects.

Monitoring in test language. This study analyzed monitoring practices and their

relations with student achievement in reading as a developmental approach in the larger

framework of teacher evaluation. As defined in the introductory chapter, monitoring is an

on-going process of collecting and analyzing information to assess the progress being made

towards set goals and objectives and to take remedial steps. It also allows the stakeholders to

see how best to optimize progress towards achieving set goals and objectives.

The PISA 2009 asked school principals if teachers in test language were monitored

using student achievement, peer reviews, principal and staff observations, and observations

by an administrative authority. The findings showed that the use of student achievement and

observations by principals and staff for monitoring purposes in test language related

significantly and positively with student achievement. Peer reviews and observations by an

external authority remained positive but statistically insignificant when controlled for other

factors at the student, school, and country levels.

The positive relations between student achievement and the two approaches to

monitoring teachers in test language—student achievement and principal and staff

observations—show developmental utility of at least some of the monitoring activities in

107

reading. These findings go in line with the body of literature that emphasizes the importance

of developmental approaches to teacher monitoring and evaluation (e.g., Rockoff &

Speroni, 2010; Sartain et al., 2011; Taylor & Tyler, 2011; Wenglinsky, 2002). The positive

associations in this study mean that if the information obtained through monitoring activities

is utilized for informing teacher practice and making necessary adjustments in instructional

strategies, the implications for student achievement become significantly positive. This

takes the discussion back to the UNDP’s emphasis on the ‘feedback’ aspect of monitoring.

The UNDP states that through ‘monitoring’ activities, stakeholders, who in this case are

teachers, receive regular feedback on their practice so as to align their efforts to achieve

their teaching goals and objectives. A positive relation of monitoring practices with student

achievement shows that, at least in reading, teachers and principals are able to effectively

respond to the questions, “Are we taking the actions we said we would take?” (UNDP,

2002, p. 8), and “Are we making progress on achieving the results that we said we wanted to

achieve?” This on-going analysis of teacher practice and student learning provides teachers,

principals, and students the opportunity to reflect upon the outcomes and make necessary

adjustments in strategies accordingly. It means that principals and schools, who set a

developmental objective in monitoring the practice of teachers in test language, and for that

matter in any subject, may experience an improved student achievement by identifying and

working on aspects of instruction that need to be improved. In this sense, monitoring should

not aim at penalizing any teacher for a lack of ability to show better student test scores. It

should aim at enabling teachers to identify their professional skills that need improvement

and also to identify individual student learning needs that teachers would want to address in

their pedagogical approaches. Such approaches in teacher monitoring will not only create an

108

environment for reflection and collaboration, it will also lead to an improvement in the

overall instructional quality in schools and hence improved student learning including

improved student test scores.

Principals’ pedagogical role. Principals have a central role in ensuring quality in

teaching in schools. Principals fulfill this role through their pedagogical leadership wherein

they work with teachers to improve instructional environments in schools. Principals apply

various tools and strategies to meet this objective of improving teacher quality. Teacher

evaluation for developmental purposes is one approach that principals adopt to improve the

quality of their teaching workforce. In the words of Bossert, Dwyer, Rowan, and Lee

(1982), “One instructional management strategy that a principal can use…is to work directly

with a teacher in order to analyze classroom problems and prescribe specific changes in

features of the instructional organization that will improve student learning” (p. 41).

Furthermore, as per the findings in a recent study conducted by Donaldson (2011), teacher

evaluation seemed to be the only tool through which principals can identify strengths and

weaknesses in teaching and accordingly plan for improving instruction in classes. It was

with this perspective that in the study models employed here, three variables with a

developmental focus in teacher evaluations consisted of principal’s pedagogical role of

observing teachers, suggesting teachers for professional improvement, and informing

teachers about the opportunities for updating their knowledge and skills. These roles of the

principals can be construed as ‘evaluative’ in the sense that they set-forth for the principal

the logic of evaluating teachers and taking remedial steps to improve their practice.

There are consistent results across the models and subjects that the principals’

observation of teachers in classes and principals suggesting teachers how to improve their

109

practice bear insignificant relationships with student achievement in mathematics, science,

and reading. Principals informing teachers about possibilities for updating their knowledge

and skills also bear insignificant relationships with student achievement in reading and

science but significant negative relationship in mathematics. This is rather counter-intuitive

given the evidence from earlier studies where classroom observations and other

standardized as well as subjective approaches to assessing teachers were found to have a

positive associations with student achievement (e.g., Rockoff & Speroni, 2010; Sartain et

al., 2011; Taylor & Tyler, 2011; Tyler et al., 2010).

This anomaly can be looked at from different dimensions. It needs to be noted that

the data sources in my study were different from most of the cited evidence. None of these

studies used the PISA 2009 dataset to explore the relationships between principals’

pedagogical roles and student achievement. Even the studies that used the previous PISA

datasets did not specifically explore principals’ pedagogical roles related to teacher

monitoring and evaluation. For example, Sartain et al. (2011) used data from a pilot

program named Chicago’s Excellence in Teaching, a program launched in 2008. The

specific purpose of this program was to improve instructional quality through a process of

evaluating teachers followed by feedback and a program for teachers’ professional

development. Other studies (Taylor & Tyler, 2011; Tyler et al., 2010) also used evidence

from teacher evaluation programs that were in place and functioning in different educational

settings. Thus, the educational settings and hence the nature of the data in my study were not

the same as other cited studies. Therefore, it can be expected that findings in my study may

or may not concur with the prior evidence. But, what does this anomaly between findings in

110

my study and prior evidence suggest about principals’ pedagogical role in apprising teachers

with a developmental focus?

First, the insignificance of the results may be due to insignificant variation in the

frequency with which principals observed teachers in classes. This could also mean that

principals’ approaches to observing teachers and associated activities in the larger teacher

evaluation frameworks in schools may need careful planning and preparation. In the studies

cited above, classroom observations as instruments were used by evaluators including

principals in a highly structured fashion. In some studies, these evaluators had undergone

intensive training in conducting evaluations before actually doing any teacher evaluations

(e.g., in Taylor & Tyler, 2011). This means that an effective teacher evaluation conducted

for developmental purposes needs to establish clear standards and procedures as well as

rigorously trained evaluators (principals in the context of this discussion) in order to identify

nuances of teaching quality that are important in improving teachers’ professional practice.

In this sense, school principals must use classroom observations in a professional manner

such that they are able to work with individual teachers in their classrooms, identify their

professional development needs, and develop and implement plans that will help teachers

improve their practice.

Second, the anomaly also points towards the complex world of principals who are in

the midst of a plethora of tasks that they are supposed to carry out on a daily basis as school

leaders. In other words, in the real world of schooling, principals are not normally able to

spend time with every teacher in classroom and give feedback and follow-up on

professional improvement of individual teachers (Bossert, Dwyer, Rowan, & Lee, 1982). If

we look at the descriptive statistics on variables that capture principals’ pedagogical role in

111

schools, it appears that principals are not able to spend sufficient time with all teachers in

classrooms in the pooled sample of 21 countries. This is reflected in about 41% of students

enrolled in schools where principals only “seldom” observe teachers in classes. In contrast,

the majority of the principals appear to be suggesting teachers for professional improvement

and informing teachers about possibilities for updating their knowledge and skills. This is

reflected in 75% students enrolled in schools where principals reported that they suggested

teachers for improvement and over 90% students enrolled in schools where principals

informed teachers about possibilities for updating their knowledge and skills. Thus, while a

large majority of principals were suggesting teachers for improvement and informing them

about possibilities for updating their knowledge and skills, they seemed to do so with a

limited amount of time that they spent with teachers in classrooms as around 50% of

students were enrolled in schools where principals seldom or never observed teachers in

classes. However, because principals are heavily loaded with many of their administrative

and other tasks in schools (Bossert, Dwyer, Rowan & Lee, 1982), finding quality time to

spend with individual teachers in classes appears to be a challenge for principals across the

pooled sample of 21 countries. While all these explanations are at best speculative in nature,

the insignificant findings in this category suggest that principals’ pedagogical role as related

to teacher evaluation is not associated with improved student achievement in the form of

student test scores.

Use of student assessment for instructional improvement. Using student

assessments for instructional improvement as part of the developmental approaches to

teacher evaluation was found to be insignificant in all the three subjects. This finding runs

counter to the evidence from earlier empirical studies where positive correlations have been

112

found between the use of student assessments for instructional improvement and student

achievement. This insignificance of the use of student data for instructional improvement

and hence student achievement may result from many possible scenarios. First, there may

just not be enough variation associated with this practice and hence insignificant results.

This could also mean that in the pooled sample of 21 countries, the use of student

information for the purposes of improving instructional practice may be suffering from

issues of focus across schools.

A discussion of focus on using student assessments for improving instruction takes

us back to Wenglinsky (2002) who emphasized the importance of classroom dynamics in

relation to student learning and achievement. Wenglinsky (2002) found that teaching quality

was as strong a factor for student achievement as any other important school level factor.

Furthermore, using authentic student assessment can uncover dynamics in student learning

that are important for developing higher-order critical thinking approaches among students.

Information from such authentic assessments can be used in a deliberative fashion where

teachers are engaged in reflection and collaboration, as was the case in the study by

Wayman and Stringfield (2006), leading to improved student learning outcomes. Wayman

and Stringfield (2006) suggest a holistic focus of the use of student data where the purpose

is to explore and improve factors in classroom practices and processes that aim at promoting

higher order thinking skills of students.

Since a large majority (83.25%) of students was enrolled in schools where student

assessments (standardized tests, teacher developed tests, teachers’ judgmental ratings,

student portfolios, and student assignments/project/homework) were used for the purposes

of improving aspects of instruction and curriculum, an insignificant relation between this

113

practice and student achievement may be attributed to a narrow focus on the objectives of

this practice in these schools. A ‘teaching to the test’ effect may also come into play when

high-stakes are attached to student assessments. In order to maneuver around those high-

stakes, schools and teachers may resort to practices that result in narrowing of the

curriculum (e.g., Berliner, 2011; Crocco & Costigan, 2007; Klein, Hamilton, McCaffrey, &

Stecher, 2000; Jerald, 2006; Koretz, 2002, 2008; Linn, 2000; Menken, 2006; Reid, 2012). If

the schools are having a narrow focus on just improving student test scores in a given

subject on a regional, state or national test, this may reflect in negative implications for

student learning when the outcome variable in my study—standardized PISA tests with a

holistic approach to assessment of student learning—demands a holistic coverage of the

taught content and student learning objectives. The PISA is comprehensive in its coverage

of the taught curriculum where the purpose is to assess if a student is able to apply learning

in his/her real life. The negative and insignificant findings hint at the possibility that schools

are using student information including achievement data to prepare them for specific

content, test format, or both.

The issue could also be a result of lack of training of teachers in the proper use of

student data for making instructional decisions. Sharkey and Murnane (2005) highlight the

need to develop necessary skills in meaningful handling of the student data. They emphasize

that such use of the student data should focus on meaningful, correct assessment, and

understanding of the data. Teachers should be able to effectively use technological support,

participate in group conversations and collaborate on tackling sensitive issues and adopt a

developmental approach to the use of student data. In order for schools to be able to make

effective and constructive use of data to improve instruction, Skarkey and Murnane (2005)

114

argue that the authorities in central offices need to play a major role in providing necessary

support and training to teachers and schools. Furthermore, in their discussion of effective

and efficient use of the student data, Boudett, Murnane, City, and Moody (2005) suggest

that teachers should be able to “1) identify patterns in data, 2) choose pattern to explore, 3)

dig deeper, 4) agree on problem, 5) ask why, 6) examine current practices, 7) develop action

plan, 8) implement action plan, and 9) assess action plan” (Boudett, Murnane, City, &

Moody, 2005, p. 701). They further suggest using technological tools to assist educators in

thinking “…about data in structured ways…” and collaborate on action plans based on the

findings from the data. Thus, Boudett and colleagues emphasize on enabling educators to

get the real feel of the structure of the data, contextualize the information that the data

carries, and think and collaborate to come up with effective instructional alternatives. They

stress on the importance of generating deep conversations among colleagues in ways that

such conversations lead to improved instructional effectiveness for all teachers. Seen in this

perspective, the use of student data for instructional improvement should not just focus on

how best to teach curriculum for better student test scores. Such use of student data should

also involve critical analysis of student performance to identify student learning needs and

develop strategies to meet those needs through careful planning of instruction.

Feldman and Tung (2001) who studied six schools that practiced Data Based Inquiry

and Decision Making (DBDM) found that teachers in these schools used student results to

reflect upon their practices and make recommendations to the wider school on how to

improve student performance. The reflective and collaborative culture associated with the

use of student data was important for improving teachers’ professional competencies and

skills and hence carried potential for improving student achievement in the six schools that

115

they studied. Similarly, Wayman and Stringfield (2006) explored the use of technology in

making sense of student data and found that the use of student data through efficient

technology and proper administrative support resulted in teacher collaboration and improved

classroom practices. Such studies indicate the instructional utility of the developmental use

of student assessments. It can be proposed based on these prior studies that the use of

student data for improving instructional quality in schools should lead to improved student

achievement. The insignificance of the findings in this study may stem from the one-level

analysis while the variation may become significant only when multi-level analyses are

carried out.

High-Stakes Approaches to Teacher Evaluation

This study explored relations between high-stakes approaches to evaluating teachers

and student achievement. As explained in the chapter on literature review, some of the

frequently used approaches in high-stakes teacher evaluation systems are public

accountability, use of student assessments to evaluate and judge teachers, and tracking of

student assessments by an administrative authority.

Public accountability. Public disclosure of teacher performance in the form of

student test scores and control through government is considered to be public accountability

with high-stakes outcomes. In this approach, student performance is shared with students,

teachers, administrators, parents, and the larger public through various means (Hooge,

Burns, Wilkoszewski, & Harald, 2012). Results in this study indicate that public

accountability related positively and significantly with student achievement in all three

subjects. These results are consistent with findings in prior studies that explored relations

116

between student achievement and accountability (e.g., Hanushek & Raymond, 2005; Jürges,

Richter, & Schneider, 2005; Levacic, 2004; West & Peterson, 2006).

According to Hanushek and Raymond (2005) public accountability just through

posting of student achievement data for public use without any consequence attached does

not yield improved student achievement. Based on their findings, they posit that

accountability may lead to improved student achievement when a high-stakes outcome is

attached to the process. Similarly, Jürges, Richter, and Schneider (2005) in their

comparative study of the states in Germany found that using Central Exit Exams (CEEs) for

benchmarking purposes raised student achievement. They attributed this raise in student

achievement to non-monetary aspects of public accountability where teachers put in extra

effort to safeguard their reputation. Thus, findings in my study highlight the aspects of

public accountability wherein different high-stakes such as a change in salary or

professional reputation are becoming important extrinsic motivating factors for teachers to

cause them to put in additional effort to raise student achievement in the form of student test

scores.

However, this positive relation of public accountability with high-stakes purposes

needs to be interpreted with caution since it also entails other consequences that are

unintended and in some instances detrimental to the overall educational goals of schools.

The unintended consequences may come in the form of dissipation of teacher morale and

deterioration of a culture of collaboration among teachers (Farrell & Morris, 2004), a

narrowing of focus in content and curriculum (Berliner, 2011), and harmful effects such as

dropouts for students particularly from disadvantaged backgrounds (McNeil & Valenzuela,

2001; McNeil, Coppola, Radigan, & Vasquez Heilig, 2008).

117

Use of student assessments to evaluate and judge teachers, and administrative

tracking. The findings show that the use of student assessments to evaluate and judge

teachers and administrative tracking of student assessments bear overall negative though in

most cases insignificant relationships with student achievement in the three subjects. In the

case of science and reading, the use of student assessments for evaluating teachers showed a

significant negative association with student achievement.

These results are largely consistent with findings from OECD (2010a) that showed

that internal student assessments carried out by schools did not bear a discernible connection

with student achievement. OECD (2010a) found that tracking of achievement data by an

administrative authority lead to a statistically insignificant and negative change in score (Δ=

-1.4) in reading performance when there were no control variables in the models. The

change in score turned positive but still remained statistically insignificant when additional

measures were placed in the models at the student, schools, and country levels.

On the contrary, my study’s findings conflict what Schütz et al. (2007) and

Wößmann et al. (2007) found in their analyses. They found that achievement tracked by

administrative authority remained positive and statistically significant in their analyses.

These researchers, however, used the PISA 2003 dataset to explore cross-country variations

with multilevel weighted least square regressions in their analyses. This difference in

models points towards the possibility that the tracking of student assessments by an

administrative authority as a national policy may have a fixed effect on all schools within

the country which is not strong enough to be observed in student level analyses. The

variance appears to become effective and significant only when multilevel modeling

approaches are used to analyze the PISA dataset which is multistage and complex in nature.

118

All in all, results in this study challenge the proposition wherein student test scores

are offered as effective measures of teacher performance in high-stakes teacher evaluation

systems (e.g., Goldhaber & Hansen, 2010; Sanders & Horn, 1994; Stronge & Tucker, 2000;

Wright et al., 1997). In essence, the overall findings of the study on the use of student

assessments in teacher evaluations are in line with the assertions from scholars who caution

over using student assessments as the sole measures of teacher performance (e.g., Baker et

al., 2010; Darling-Hammond, Amrein-Beardsley, Haertel, & Rothstein, 2012; Mathis, 2012;

Rosenkvist, 2010). The findings also raise flags in terms of negative consequences of high-

stakes teacher evaluations. Suen and Yu (2006) in their analysis of the Chinese examination

system Keju emphatically assert that any testing and assessment system with high-stakes

attached to it will lead to “social consequences” that will render validity of such measures

problematic. The social consequences could be long lasting that include “rote

memorization,” “cheating,” “focus on test-taking skills,” and “psychopathological effects on

the examinee.” While they studied these social consequences from students’ perspectives

who are subjected to high-stakes testing, the “social consequences” could well be associated

with teachers as well who may experience similar consequences as results of high-stakes in

their evaluations. Suen and Lu’s (2006) suggestion that the issues of assessment validity

emanating from these social consequences can most effectively be addressed by detaching

some of the high-stakes from testing and assessments. The same suggestion may hold true

for teacher evaluation with high-stakes as well. Thus, it can be suggested that using student

assessments for evaluating and judging teacher effectiveness is a strategy with potential

negative fall-outs for teachers’ practice and student learning.

119

Interactions

This study looked at three interactions as products of school level factors and teacher

evaluation practices. The study hypothesized that classrooms observations should become

important and significant when parents are informed about the progress of their children,

principals having authority to make salary changes, and school type being private. I will

discuss here only two interactions that appeared significant in the analyses.

Though the classroom observations by principals as a main predictor showed a

negative association with student achievement in the models consisting of principals’

pedagogical roles, teachers reporting classroom observations as “important” and “highly

important” in their appraisals showed a positive and significant association with student

achievement when it interacted with parents being informed about the progress of their

children. Similarly, classroom observations being “important” and “highly important” as

criteria in teacher appraisals showed positive and significant interactions with principals’

authority in making changes in teachers’ salaries in science and reading.

This is in consonance with the findings from Wößmann et al. (2007) wherein

interactions between “autonomy in formulating school budget” and “accountability” aspects

of schools showed as positive and significant in relation to student achievement. In this

sense, looking at the phenomenon closely, informing parents about the progress of their

children makes principals and teachers accountable in front of parents in terms of their

personal professional reputation and hence career prospects in schools. The positive

interaction between classroom observations as important criteria in teacher evaluations and

informing parents about the progress of their children also confirms previous research and

scholarly evidence on the important and significant role of parents in schools (e.g.,

120

Henderson, 1987, 1988; Fan & Chen, 2001). The earlier evidence on the effects and

relations of parental involvement in schools overwhelmingly suggests a positive association

with student achievement (Fan & Chen, 2001; Ingram, Wolfe, & Lieberman, 2007; Jeynes,

2012; Sui-Chu & Willms, 1996). While parents have different reasons to involve themselves

in the schools where their children are enrolled, the findings in this study show that parental

involvement plays over and above those reasons in relation to student achievement. Parental

involvement through the disclosure of their children’s progress appears to be leading the

principals to make their classroom observations effective at raising student achievement.

This suggests that making teachers more accountable for their performance while at the

same time having a developmental purpose attached to teacher evaluation can have

significantly positive implications for student achievement.

The same logic can be extended to the significant positive interaction between

classroom observations being “important” and “highly important” as criteria in teacher

appraisals and principals authority in making changes in teachers’ salaries. When teachers

are held accountable for the quality of their practice through principals’ authority to make

changes in their salaries, it appears to make classroom observations effective at raising

student achievement. Teachers and principals appear to be producing meaningful

interactions through classroom observations that lead both to work towards the intended

goals of improving classroom practices and hence student achievement. This could also

mean that principals are able to assert their authority and push teachers to follow the school

goals of raising student achievement in the form of student test scores.

121

Teacher Evaluation: Country Variables

As described in the methods section, 8 variables at the country level were reduced to

two components namely “professional outcomes” and “others.” The first component showed

a negative and significant association with student achievement in all three subjects. The

negative association with student achievement goes in line with the body of literature that

cautions against using such measures as student tests scores as the sole measures of

teachers’ performance (e.g. Darling-Hammond, Amrein-Beardsley, Haertel, & Rothstein,

2012; Mathis, 2012; Rosenkvist, 2010). At the same time, while the relationship is found to

be negative for this component which also consists of some developmental criteria in

teacher evaluations, earlier studies such as Wenglinsky (2002) and Taylor and Tyler (2011)

show the efficacy of developmental approaches to teacher evaluation such as direct

appraisal of classes, innovation in teaching, and professional development undertaken. This

discrepancy in findings of this study necessitates further probing of the effects of individual

country level variables as used in this study. The mismatch between findings in my study

and earlier studies also opens up avenues for exploring opportunities and challenges of

combining the two important international datasets—PISA and TALIS—for statistical

analyses on complex classroom processes.

The second component on criteria for teacher appraisals, “others,” included feedback

from parents and relations with colleagues. The results showed that this component related

negatively to student achievement in mathematics and reading. The component showed a

negative but insignificant association with student achievement in science. This suggests

that while parental involvement is a strategy in the right direction as regards raising student

achievement for all students, using parental feedback as a criterion in teacher appraisals is

122

not associated with positive outcomes for student achievement. Similarly, positive relations

with colleagues can contribute significantly in generating a synergy and an atmosphere of

collaboration among teachers, using this as a criterion for teacher appraisal appears to be an

ineffective strategy in teacher evaluations.

The third component consisted of outcomes and impact of teacher evaluation on

various aspects of teachers’ professional lives. This component showed a positive

significant association with student achievement in mathematics. In science it showed as

insignificant and positive while in reading it delivered a significant negative association

with student achievement; thus, this component at best showed mixed results. A closer look

at these variables shows that many of these are high-stakes in nature involving significant

effects on teachers’ career and employability. In the case of mathematics, these findings

point towards the assertions from earlier studies (e.g., Schütz et al., 2007; Wößmann et al.,

2007) that highlight the importance of attaching high-stakes outcomes to teacher evaluation.

Furthermore, positive association in mathematics, an insignificant positive

association in science, and a significant negative association in reading point toward a

complex dynamic that appears to be at play with regard to the relative importance of these

subjects in schools. Given the general observation that mathematics and science receive

greater attention in schools as “key” or “important” subjects, in the case of mathematics, this

might well be a result of a disproportionate importance accorded to teachers of mathematics

and to some extent to the teachers of science. Teachers of mathematics may be expected to

place a greater emphasis on student test scores and accordingly they may be able produce

better results compared to teachers in reading thereby leading to a differential award in

terms of monetary incentives, public recognition, a greater role in school development, and

123

more opportunities for professional development for teachers of mathematics. In reading,

these aspects may run against the subject teachers by not receiving, for example, as much a

share of the incentives as teachers in mathematics and science and hence may appear as a

demotivating factor for teachers in this subject.

Following this line of argument, it may be relevant to propose to offer differential

incentives to teachers based on their performance. However, such a proposition suffers

challenges in the light of evidence from other studies where incentives such as pay-for-

performance approaches suggest negative (e.g., Fryer, 2013), weak, or at best mixed effects

on student achievement (e.g., Springer et al., 2012). Also, as stated earlier, teacher

evaluations with high-stakes consequences have been found to have other negative social

effects such as a decline in teacher collaboration (e.g., Farrell & Morris, 2004), narrowing of

curriculum (Menken, 2006), and other social consequences which are detrimental in nature

(Suen & Lu, 2006). As far as findings of this study go, a high-stakes effect is visible in

teacher evaluation practices in relation to student achievement in mathematics, and to some

extent in science. Teachers of mathematics appear to be making additional efforts in helping

students to score better in their assessments. This association appears to be a result of

extrinsic incentives to safeguard professional reputation, secure a positive change in salary,

and/or a change in work responsibilities in schools.

Policy Implications and Recommendations

In the light of the findings of the study and how these findings are situated within the

larger body of prior evidence and literature on the subject, some key policy implications and

recommendations are suggested here.

124

Given that the schools across OECD countries rely heavily on classroom

observations to assess and evaluate teachers (Isoré, 2009), when it comes to evaluating and

developing quality of teaching workforce, it becomes paramount to look at ways through

which this tool can be made effective in the hands of principals and other observers in

schools. As Danielson and McGreal (2000) liken classroom observation to teacher

evaluation counting it as the most effective way to witness the interactions between teachers

and students, schools need to identify ways through which this tool can be effectively

utilized in classrooms. Because teacher evaluation mostly happens internally in schools

(OECD, 2009b), classroom observations by principals need to be looked into for their

efficacy. This also becomes necessary to improve efficacy of classroom observations to

improve teachers and teaching since 73% of the teachers accorded highest importance to

classroom observations as criteria in their appraisals (OECD, 2009b). In contrast to this high

majority of teachers according high importance to classroom observations as criteria in their

appraisals, about 50% of the students were enrolled in schools where principals seldom or

never observed teachers in classes. This exposes somewhat a mismatch between what

teachers consider as important in their appraisal criteria and principals’ administrative and

pedagogical focus in schools. In the light of this mismatch in the focus on the part of school

principals and teachers’ expectations of their appraisal criteria, it would be important for the

former to assess their roles as pedagogical leaders in schools. Such an assessment should

lead principals to give quality time to individual teachers in their classrooms to guide them

towards improving instructional practices. Principals should develop an enriched picture of

teachers’ practices in classes, assess their skills based on this enriched picture, and devise

effective strategies for teachers’ development tailored to the individual needs of teachers. In

125

this regard, it would be a wise policy to develop principals as effective pedagogical leaders

who are using classroom observations and other developmental approaches to assess

teachers’ effectiveness, and who are working closely with individual teachers to improve

instruction for the optimal learning outcomes for all students.

Second, teacher monitoring in test language and for that matter in any other subject

has the potential to raise student achievement. If schools are able to effectively utilize this

approach to identify strengths and weaknesses in teachers’ practice and accordingly plan for

adjustment of progress towards achieving instructional goals and objectives, teacher

monitoring can be an effective tool for raising student achievement.

Third, using student assessments for making high-stakes decisions in teacher

evaluations have important policy implications. Student assessments as used for teacher

evaluation and for administrative tracking, which often come with high-stakes

consequences, show negative and in some cases significant associations with student

achievement. In the light of this, it can be suggested that student assessments as criteria for

teacher evaluation and tracking of the same by an administrative authority need careful

examination. It will be in the hands of the schools to decide if and how much of a share

should be given to student assessments as criteria in teacher evaluation mechanisms.

Schools will have to carefully note that the use of student assessments in high-stakes teacher

evaluations entail the danger of spiraling the instructional processes into what is commonly

known as “teaching to the test” effect while producing other negative social consequences as

prior evidence suggests (Berliner, 2011; Crocco & Costigan, 2007; Klein, Hamilton,

McCaffrey, & Stecher, 2000; Jerald, 2006; Koretz, 2002, 2008; Menken, 2006; Suen & Lu,

2006). Therefore, it appears a relevant and important strategy and policy for schools to cut

126

down on the share of student assessments in teacher evaluations with high-stakes

consequences. Schools can use other valid measures such as standards-based classroom

observations and rubrics that have greater developmental potential. Also, schools may

seriously re-look into their practice of using student test scores as the sole measures of

teacher performance. Rosenkvist (2010) asserts, using evidence from earlier studies on the

subject, that using student test results as the sole measure of teacher performance for making

high-stakes decisions about teachers is inadequate. Mathis (2012) also warns of using

student test scores as the only measures of assessing teachers in high-stakes evaluations. He

posits:

While such summative evaluations can be useful, lawmakers should be wary of

approaches based in large part on test scores: the error in the measurements is

large—which results in many teachers being incorrectly labeled as effective or

ineffective; relevant test scores are not available for the students taught by most

teachers, given that only certain grade levels and subject areas are tested; and the

incentives created by high-stakes use of test scores drive undesirable teaching

practices such as curriculum narrowing and teaching to the test. (p. 1)

In the light of this and the fact that across TALIS countries, with few exceptions, more than

50% of the criteria for teacher appraisal is in the form of student test scores (OECD, 2009b),

schools and policymakers may need to review their strategies on using student performance

as an evidence of teacher performance in their teacher evaluation systems.

Fourth, public accountability offers potential pay-offs with regard to raising student

achievement in the form of student test scores. Schools can assess their local situations and

accountability environments and devise strategies on how effective it could be for them to

127

make student performance public. However, caution should be practiced in the use of this

approach since attaching a high-stakes consequence almost invariably produces unintended

consequences. Teacher morale, collaboration, teacher-student relations, and a number of

other contextual and cultural factors may be negatively affected because of this practice

leading to long term harmful effects. Schools will have to carefully analyze the tradeoffs

between having this practice and the potential long-term gains. As far as findings of this

study go, public accountability establishes a positive link with student test scores in

mathematics, science, and reading.

Fifth, the positive interactions between classroom observations and informing

parents about their children’s progress and principals having authority in making changes in

teachers’ salaries lead to two important policy recommendations. First, parental involvement

in schools shows as important with reference to one key process—classroom observations.

It is widely understood that parental involvement is important if the purpose is to enhance

student learning outcomes (Fan & Chen, 2001; Ingram, Wolfe, & Lieberman, 2007; Jeynes,

2012). Similarly, when schools and teachers are made accountable to parents, it reflects

positively in enhancing efficacy of within-school processes such as classroom observations.

Therefore, it would be important for policymakers and educators to enhance parental

involvement in schools. Parental involvement undoubtedly brings its own challenges.

However, the potential pay-offs seem to far outweigh the challenges. Parents can bring in

aspects of student learning that schools and educators may not independently grasp. Parents

can assist educators to identify and meet individual student needs which otherwise may go

undetected when there is a gap between educators and parents. All these positive aspects of

parental involvement have significant potential to ultimately raise student achievement for

128

all students. At the same time, while parental involvement should be promoted, their

feedback on teaching as a criterion for teacher appraisal should be avoided given that this

variable returned a negative relation with student achievement in this study. Secondly,

principals’ authority to make changes in teachers’ salaries may be promoted as part of

reforms aimed at school-based management. However, once again, attaching high-stakes

consequences may show short term gains in student achievement, they may not be effective

in the long run and that student learning may suffer from issues of watering-down of

curriculum leading to a “teaching to the test” effect.

Last but not the least, evidence on the variation in student and school level

constructs shows that student achievement is overwhelmingly influenced by issues

surrounding students’ socioeconomic backgrounds, educational resources in schools, and

other demographic factors. Policies of teacher evaluation and other educational processes

would not likely succeed when there is an inadequate supply of direly needed educational

resources, when there are huge income disparities across different socioeconomic strata, and

when there are dichotomies in educational systems that lead to different outcomes for

children from different socioeconomic backgrounds. It will be paramount for effective

policy development in schools to equalize all these key background and school level factors

thereby providing a level playing field for all students in all classes.

Limitations of the Study

As I stated at the outset of this chapter, the study suffers from some limitations that

need to be considered before generalizing the findings beyond the target population of

countries from which the sample is drawn. First, the study looks at student achievement only

in the form of student test scores as reflected in the PISA tests on mathematics, science, and

129

reading. Many scholars have singled out the limitations of using student achievement in the

form of student test scores in a monolithic fashion to assess teacher effectiveness (Baker et

al., 2010; Darling-Hammond et al., 2012; Klein et al., 2000; Koretz, 2008; Kornhaber,

2004a; Mathis, 2012; Rosenkvist, 2010). They rightly assert that the student test scores

present a limited view of student learning and that using this limited information to make

consequential decisions about tenure, teacher salaries, and other key matters related to

teachers are likely to lead to inflated scores without commensurate gains in students’

knowledge and skills. Thus, findings in this study should be looked at in relation to

effectiveness or otherwise of teacher evaluation practices and purposes in raising student

achievement only in the form of student test scores.

The second major limitation of the study stems from the specifications of the study

models. The study has explored relationships between student achievement and teacher

monitoring and evaluation at one level i.e., student level in a pooled sample of 21 countries.

It needs to be noted that having an aggregate sample and employing one-level cross-country

analysis may be problematic given the complex sampling structure in the PISA 2009 survey.

Such a cross-country analysis at one level may distort or curtail the true picture of the

variation across different levels. Some of the variation that can be attributed to country level

may be making its way down to student level thereby giving an inflated picture of the

between-subjects variance. Also, some of the relationships that appear insignificant in an

aggregate sample may appear significant in multi-level analyses since one-level analyses at

student level assumes fixed effects across countries. Some of these limitations have been

offset by creating interaction terms between different levels of the data as described earlier

in the methods on variables. However, the issue may still persist since we know that

130

countries, and in many cases schools within a country, differ in terms of their evaluation

practices and purposes. It is expected that these between-country and between-school

variations, wherever significant and applicable, will not greatly compromise the findings in

this study since the OLS models employed here make use of student weights in combination

with all five plausible values using the standard procedures suggested by OECD (2009c).

A third possible limitation of the study arises from the very focus and content of the

PISA 2009 survey. PISA 2009 sought information from principals on teacher evaluation and

appraisal practices in a highly structured fashion. Except for the items covering principals’

pedagogical role in teacher evaluations, all main variables in this study that related to

teacher evaluation and appraisal were structured on a “Yes/No” basis. This may have

resulted in a loss of important information as regards intricate details of teacher monitoring

and evaluation practices in schools. Thus, while teacher monitoring and evaluation are

complicated processes with huge variation across countries and even within countries across

schools and systems, the “Yes/No” format in the survey does not cover the full range of the

complex dynamics of the process. Some of this limitation has been allayed by combining

information from the TALIS 2008 that presents multidimensional and more detailed

information on teacher evaluation practices and purposes by incorporating teachers’ views

and experiences on the subject.

Recommendations for Further Research

The study offers some recommendations for future research around teacher

monitoring and evaluation. First, studies involving multilevel mixed (random and fixed)

models around different approaches and purposes of teacher monitoring and evaluation

using both the datasets—PISA and TALIS—in conjunction will be a valuable scholarly

131

pursuit. Since there is variation in teacher evaluation practices and purposes across schools

as well as across countries, the multilevel models should look at between school and cross-

country variations separately to identify differences in practices at the level of schools and

countries. In such analyses, it would be relevant to use the primary TALIS dataset in

conjunction with the primary PISA dataset. However, combining both the datasets may pose

technical challenges since there is no common identifier in the datasets at the school and

student levels to enable researchers to merge the two datasets. Therefore, a more relevant

recommendation for the PISA and TALIS surveys would be to combine school, teacher, and

student information in one survey, thereby covering a whole range of teacher monitoring

and evaluation in schools.

The study also recommends exploring teacher evaluation practices in quasi-

experimental settings to uncover dynamics that are making classroom observations effective

when combined with parental involvement and principals’ authority in making changes in

teachers’ salaries. The finding that classroom observations given high importance in schools

establishes significant positive relations with student achievement when interacted with

parental involvement and principals’ authority in making changes in teacher salaries, it will

be an important scholarly pursuit to explore dynamics undergirding such interactions. For

example, how parents are causing principals’ observations of teachers to become effective

when they are informed of the progress of their children would constitute as an important

study to highlight the subtleties of this relationship in schools.

Case studies of select groups of countries with radically different approaches in

teacher evaluation systems would also be a fruitful area to explore. For example, Finland is

a top performing PISA country yet it has minimal teacher evaluations in schools. On the

132

other hand, Chile has high activity around teacher evaluations but its performance in the

PISA assessments is not as promising. A significant portion of teacher evaluation in Chile

leads to high-stakes consequences including financial rewards and in some cases dismissal

from service. Thus, it would be academically and policy-wise a relevant pursuit to single out

countries at the extremes of student performance and teacher evaluation approaches to see

what aspects of teacher evaluations are really important when it comes to raising student

achievement.

Conclusions

Teacher quality is one of the most significant determinants of student achievement in

schools. A quality teaching workforce delivering high-quality instruction for the benefit of

all students in schools has been found to be a key policy ingredient of some of the world’s

top performing education systems. In their study of world’s high-performing school

systems, Barber and Mourshed (2007) emphatically put forward that:

…high-performing school systems, though strikingly different in construct and

context, maintained a strong focus on improving instruction because of its direct

impact upon student achievement. To improve instruction, these high-performing

school systems consistently do three things well:

- They get the right people to become teachers (the quality of an education system

cannot exceed the quality of its teachers).

- They develop these people into effective instructors (the only way to improve

outcomes is to improve instruction).

133

- They put in place systems and targeted support to ensure that every child is able

to benefit from excellent instruction (the only way for the system to reach the

highest performance is to raise the standard for every student). (p. 13)

In the light of such convincing arguments on the critical place that instruction and teacher

quality assume in the world of schooling, reforming and improving teacher monitoring and

evaluation appear to be relevant policy pursuits. Schools can use teacher monitoring and

evaluation as tools to improve teacher effectiveness for improving student learning

outcomes including student test scores.

This study looked at teacher monitoring and evaluation practices and purposes and

their relationships with student achievement in the form of student test scores captured by

the PISA 2009 survey. As the findings of this study as well as prior evidence suggest,

teacher monitoring and evaluation are multifaceted approaches with different purposes and

outcomes. The evidence in this study has only confirmed the complexity of the process

while exploring its potential utility in raising achievement for all students. The study

suggests that while it is important to assess teachers in order to improve their quality, there

is no one unified approach that is eclectic in terms of its efficacy in raising student

achievement in different educational contexts across the globe. Thus, devising a feasible

teacher monitoring and evaluation system for a specific educational context will largely

remain with the policymakers at each level of governance to work in congruence to come up

with the best design that can be effective in local conditions of each school system.

However, some of the findings of this study that remained in conflict with prior evidence

suggest that policymakers, educators, and school leaders must base their policies on rigorous

research. The evidence should mostly come from studies conducted in the context in which

134

the teacher evaluation policies are meant to be applied. Findings from studies in other

educational contexts should only serve as spurs to generate meaningful conversations

among key stakeholders around viable policy alternatives.

The bottom line of all this discussion can be summed up in the words of Kornhaber

(2004b):

There are many purposes and forms of assessment. However, there should be just

one motivation: assessment should serve as a tool to enhance all students’

knowledge, skills and understanding so that they can function at the highest possible

level in the wider world. (p. 91)

Extrapolating Kornhaber’s argument to teacher monitoring and evaluation and paraphrasing

her statement to suit the subject matter, it would be logical to suggest that the purpose of any

teacher monitoring and evaluation system should have just one motivation: it should serve

to enhance teachers’ capacities, skills, and knowledge and understanding so that they can

function at the highest possible level of their professional capacity to enable all students to

function at the highest possible level in the wider world. Policymakers can achieve this goal

by striking a balance between developmental and high-stakes approaches to teacher

monitoring and evaluation. The study suggests that teacher monitoring and evaluation must

be used only as means to achieve an end. Teacher evaluation must not become an end, and

for that matter a punitive one, in itself with the end being improved quality in instruction

and hence optimal student learning outcomes for all students. Last but not the least,

governments and policymakers need to equalize educational opportunity for all students by

removing the barriers associated with the socioeconomic disparities. If these barriers persist,

135

students will be limited in their ability to optimally benefit from otherwise well-intended

policy reforms including reforms to improve teacher monitoring and evaluation practices.

136

References

Astin, A.W. (1982). Excellence and equity in American education. Washington, DC:

National Commission on Excellence in Education. Retrieved from ERIC Database.

(ED 227098)

Baker, E. L., Barton, P. E., Darling-Hammond, L., Haertel, E., Ladd, H. F., Linn, R. L., …

Shepard, L. A. (2010). Problems with the use of student test scores to evaluate

teachers (Briefing Paper No. 278). Washington, DC: Economic Policy Institute.

Retrieved from ERIC database. (ED516803)

Barber, M., & Mourshed, M. (2007). How the world’s best school-systems come out on top?

McKinsey&Company. Retrieved May 20, 2012 from

http://mckinseyonsociety.com/downloads/reports/Education/Worlds_School_Systems

_Final.pdf

Beese, J., & Liang, X. (2010). Do resources matter? PISA science achievement comparisons

between students in the United States, Canada and Finland. Improving Schools, 13(3),

266–279. doi:10.1177/1365480210390554

Berliner, D. (2011). Rational responses to high-stakes testing: The case of curriculum

narrowing and the harm that follows. Cambridge Journal of Education, 41(3), 287–

302.

Bingham, R. D., Heywood, J. S., & White, S. B. (1991). Evaluating schools and teachers

based on student performance: Testing and alternative methodology. Evaluation

Review, 15(2), 191–218.

Bishop, J. H. (1997). The effect of national standards and curriculum-based exams on

achievement. American Economic Review, 87(2), 260-264.

137

Bishop, J. H. (1999). Are national exit examinations important for educational efficiency?

Swedish Economic Policy Review, 6 (2), 349-398.

Borman, G. D., & Kimball, S. M. (2005). Teacher quality and educational equality: Do

teachers with higher standards-based evaluation ratings close student achievement

gaps? The Elementary School Journal, 106(1), 3–20.

Bossert, S. T., Dwyer, D. C., Rowan, B., & Lee, G. V. (1982). The instructional

management role of the principal. Educational Administration Quarterly, 18(3), 34–64.

doi:10.1177/0013161X82018003004

Boudett, K. P., Murnane, R. J., City, E., & Moody, L. (2005). Teaching educators how to

use student assessment data to improve instruction. The Phi Delta Kappan, 86(9), 700-

706.

Bovens, M. (2005). Public accountability. In Ferlie, E., Lynn, L. E., & C. Pollitt (Eds.), The

Oxford Handbook of Public Management (pp. 182-208). Oxford: Oxford University

Press.

Buddin, R., & Zamarro, G. (2009). Teacher qualifications and student achievement in urban

elementary schools. Journal of Urban Economics, 66(2), 103–115.

doi:10.1016/j.jue.2009.05.001

Carlson, R. V., & Park, R. (1976). Teacher evaluation: Relevant concepts and procedures.

Retrieved from Eric Database. (ED 129 739)

Cohen, D. K., & Hill, H. C. (2000). Instructional policy and classroom performance: The

mathematics reform in California. Teachers College Record, 102(2), 294-343.

138

Crocco, M. S., & Costigan, A. T. (2007). The narrowing of curriculum and pedagogy in the

age of accountability: Urban educators speak out. Urban Education, 42(6), 512–535.

doi:10.1177/0042085907304964

Danielson, C. (1996). Enhancing professional practice: A framework for teaching.

Alexandria, VA: Association for Supervision and Curriculum Development.

Danielson, C., & McGreal, T. L. (2000). Teacher evaluation to enhance professional

practice. Alexandria, VA: Association for Supervision and Curriculum

Development.

Darling-Hammond, B. L., Amrein-Beardsley, A., Haertel, E., & Rothstein, J. (2012).

Evaluating teacher evaluation. Phi Dalta Kappan, 93(06), 8-15.

Darling-Hammond, L. (2000). Teacher quality and student achievement: A review of state

policy evidence. Education Policy Archives, 8(1), 1–44.

Demir, İ., Kılıç, S., & Ünal, H. (2010). Effects of students’ and schools’ characteristics on

mathematics achievement: Findings from PISA 2006. Procedia - Social and

Behavioral Sciences, 2(2), 3099–3103. doi:10.1016/j.sbspro.2010.03.472

Demir, İ., Ünal, H., & Kılıç, S. (2010). The effect of quality of educational resources on

mathematics achievement: Turkish case from PISA-2006. Procedia - Social and

Behavioral Sciences, 2(2), 1855–1859. doi:10.1016/j.sbspro.2010.03.998

Development Assistance Committee [DAC] (n.d.). Glossary of key terms in evaluation and

results based management. Retrieved from Organization for Economic Cooperation

and Development [OECD] website: http://www.oecd.org/dac/evaluation/18074294.pdf

Donaldson, M. L. (2009). So Long, Lake Wobegon ? Using teacher evaluation to raise

teacher quality. Retrieved from Center for American Progress website:

139

http://www.americanprogress.org/issues/education/report/2009/06/25/6243/so-long-

lake-wobegon/

Donaldson, M. L. (2011). Principals’ approaches to developing teacher quality: Constraints

and opportunities in hiring, assigning, evaluating, and developing teachers. Retrieved

from Center for American website:

http://www.americanprogress.org/issues/2011/02/pdf/principal_report.pdf

Evertson, C. M., & Holley, F. M. (1981). Classroom observation. In J. Millman (Ed.),

Handbook of Teacher Evaluation (pp. 90-109). Beverly Hills: Sage Publications.

Fan, X., & Chen, M. (2001). Parental involvement and students’ academic achievement : A

meta-analysis. Educational Psychology Review, 13(1), 1–23.

Farrell, C., & Morris, J. (2004). Resigned compliance: Teacher attitudes towards

performance-related pay in schools. Educational Management Administration &

Leadership, 32(1), 81–104.

Faubert, V. (2009). School evaluation: Current practices in OECD countries and a

literature review, OECD Education Working Papers, No. 42, OECD Publishing.

http://dx.doi.org/10.1787/218816547156

Feldman, J., & Tung, R. (2001). Using data-based inquiry and decision making to improve

instruction. ERS Spectrum, 19(03), 10–19.

Fryer, R. G. (2013). Teacher incentives and student achievement: Evidence from New York

City Public Schools. Journal of Labor Economics, 31(2), 373–407.

doi:10.1086/667757

140

Fuchs, T., & Wößmann, L. (2007). What accounts for international differences in student

performance? A re-examination using PISA data. Empirical Economics, 32(02), 433-

464. DOI 10.1007/s00181-006-0087-0

Gallagher, H. A. (2004). Vaughn Elementary’s Innovative Teacher Evaluation System: Are

teacher evaluation scores related to growth in student achievement? Peabody Journal

of Education, 79(4), 79–107.

Glazerman, S., Loeb, S., Goldhaber, D., Staiger, D., Raudenbush, S., & Whitehurst, G.

(2010). Evaluating teachers: The important role of value-added. Retrieved from

Brown Center of Education Policy at Brookings website:

http://www.brookings.edu/research/reports/2010/11/17-evaluating-teachers

Goe, L., Bell, C., & Little, O. (2008). Approaches to evaluating teacher effectiveness : A

research synthesis. Retrieved from National Comprehensive Center for Teacher

Quality website: www.tqsource.org/publications/EvaluatingTeachEffectiveness.pdf

Goldhaber, D., & Hansen, M. (2010). Using performance on the job to inform teacher tenure

decisions. American Economic Review, 100(2), 250–255.

Haefele, D. L. (1993). Evaluating teachers: A call for change. Journal of Personnel

Evaluation in Education, 7(1), 21–31. doi:10.1007/BF00972346

Hanushek, E. A. (1992). The trade-off between child quantity and quality. Journal of

Political Economy, 100(1), 84–117.

Hanushek, E. A. (2003). The failure of input-based schooling policies. The Economic

Journal, 113(485), F64–F98.

141

Hanushek, E. A., Kain, J. F., Brien, D. M. O., & Rivkin, S. G. (2005). The market for

teacher quality (Working Paper No. 11154). Retrieved from National Bureau of

Economic Research website: http://www.nber.org/papers/w11154

Hanushek, E. A., & M. E. Raymond (2005). Does school accountability lead to improved

student performance? Journal of Policy Analysis and Management, 24(2), 297-328.

Harris, D. N., & Sass, T. R. (2011). Teacher training, teacher quality and student

achievement. Journal of Public Economics, 95(7-8), 798–812.

doi:10.1016/j.jpubeco.2010.11.009

Henderson, A. (1987). The evidence continues to grow: Parent involvement improves

student achievement. Columbia, MD: National Committee for Citizens in Education.

Henderson, A. T. (1988). Parents are a school’s best friends. The Phi, 70(2), 148–153.

Holtzapple, E. (2003). Criterion-related validity evidence for a standards-based teacher

evaluation system. Journal of Personnel Evaluation in Education, 17(03), 207–219.

Hooge, E., Burns, T., & H. Wilkoszewski (2012). Looking beyond the numbers:

Stakeholders and multiple school accountability (Working Paper No. 85). Retrieved

from OECD website: http://dx.doi.org/10.1787/5k91dl7ct6q6-en

Hoover-Dempsey, K. V, & Sandler, H. M. (1997). Why do parents become involved in their

children’s education? Review of Educational Research, 67(1), 3–42.

Ingram, M., Wolfe, R. B., & Lieberman, J. M. (2007). The role of parents in high-achieving

schools serving low-income, at-risk populations. Education and Urban Society, 39(4),

479–497.

142

Isoré, M. (2009). Teacher Evaluation: Current Practices in OECD Countries and a

Literature Review (Working Paper, No. 23). Retrieved from OECD website:

http://dx.doi.org/10.1787/223283631428

Jerald, B. C. D. (2006, August). The hidden costs of curriculum narrowing. Washington,

DC: The Center for Comprehensive School Reform and Improvement. Retrieved

from ERIC database. (ED494088)

Jeynes, W. (2012). A meta-analysis of the efficacy of different types of parental

involvement programs for urban students. Urban Education, 47(4), 706–742.

doi:10.1177/0042085912445643

Jürges, H., Richter, W. F., & Schneider, K. (2005). Teacher quality and incentives:

Theoretical and empirical effects of standards on teacher quality.

FinanzArchiv/Public Finance Analysis, 61(3), 298–326.

Kimball, S. M., White, B., Milanowski, A. T., & Borman, G. (2004). Examining the

relationship between teacher evaluation and student assessment results in Washoe

County. Peabody Journal of Education, 79(4), 54–78.

Klein, S. P., Hamilton, L. S., Mccaffrey, D. F., & Stecher, B. M. (2000). What do test scores

in Texas tell us? Education Policy Analysis Archives, 8(49), 1–22.

Koretz, D. M. (2002). Limitations in the use of achievement tests as measures of educators’

productivity. The Journal of Human Resources, 37(4), 752–777.

Koretz, D. M. (2008). Measuring up: What educational testing really tells us. Cambridge,

MA: Harvard University Press.

Kornhaber, M. L. (2004a). Appropriate and inappropriate forms of testing, assessment, and

accountability. Educational Policy, 18(1), 45–70. doi:10.1177/0895904803260024

143

Kornhaber, M. L. (2004b). Assessment, standards, and equity. In J. A. Banks & C. A. M.

Banks (Eds.), Handbook of research on multicultural education (2nd ed., pp. 91–109).

San Francisco, CA: Jossey-Bass.

Larsen, M. A. (2005). A critical analysis of teacher evaluation policy trends. Australian

Journal of Education, 49(3), 292–305.

Latham, G., & Wexley, K. (1982). Increasing productivity through performance appraisal.

Monterey, CA: Brooks/Cole.

Levacic, R. (2004). Competition and the performance of English secondary schools: Further

evidence. Education Economics, 12(2), 177-193.

Levin, H. M. (1974). A conceptual framework for accountability in education. The School

Review, 82(3), 363–391.

Levitt, R., Janta, B., & Wegrich, K. (2008). Accountability of teachers: Literature review.

Retrieved from RAND Corporation website:

http://www.rand.org/pubs/technical_reports/TR606.html

Linn, R. L. (2000). Assessments and accountability. Educational Researcher, 29(2), 4–16.

Looney, J. (2011). Developing high-quality teachers: Teacher evaluation for improvement,

European Journal of Education, 46(4), 440–455.

Mathis, W. (2012). Research-based options for education policy making. Retrieved from

National Education Policy Center website: http://nepc.colorado.edu

McGreal, T. L. (1988). Evaluation for enhancing instruction: Linking teacher evaluation and

staff development. In S. J. Stanely, & W. J. Popham (Eds.), Teacher evaluation: Six

prescriptions for success (pp. 1-29). Alexandria, VA: Association for Supervision

and Curriculum Development.

144

McNeil, L. M. & Valenzuela, A. (2001). The harmful impact of the TAAS system of testing

in Texas: Beneath the accountability rhetoric. In M. Kornhaber & G. Orfield (Eds.),

Raising standards or raising barriers? Inequity and high-stakes testing in public

education (pp. 127-150). New York: Century Foundation.

McNeil, L. M., Coppola, E., Radigan, J., & Vasquez Heilig, J. (2008). Avoidable losses:

High-stakes accountability and the dropout Crisis. Education Policy Analysis Archives,

16(3). Retrieved from Policy Analysis Archives website:

http://epaa.asu.edu/epaa/v16n3/

Menken, K. (2006). Teaching to the test: How No Child Left Behind impacts language

policy, curriculum, and instruction for English language learners. Bilingual Research

Journal, 30(2), 521–546.

Milanowski, A. (2004). The relationship between teacher performance evaluation scores and

student achievement : Evidence from Cincinnati. Peabody Journal of Education,

79(4), 33-53.

Milanowski, A. T., Kimball, S. M., & White, B. (2004). The relationship between

standards-based teacher evaluation scores and student achievement: Replication and

extensions at three sites. Retrieved from Consortium for Policy Research in Education

website: www.cpre-wisconsin.org/papers/3site_long_TE_SA_AERA04TE.pdf

National Center for Education Statistics [NCES]. (1996). High school seniors' instructional

experiences in science and mathematics. Washington, DC: U.S Government Printing

Office.

Nolan, J. F., & Hoover, L. A. (2008). Teacher supervision and evaluation: Theory into

practice (2nd ed.). Hoboken, N.J: John Wiley & Sons Inc.

145

Organization for Economic Cooperation and Development. (2005). Teachers matter:

Attracting, developing and retaining effective teachers. Retrieved from Organization

for Economic Cooperation and Development website:

http://dx.doi.org/10.1787/9789264018044-en

Organization for Economic Cooperation and Development. (2009a). Evaluating and

rewarding the quality of teachers: International practices. Retrieved from

Organization for Economic Cooperation and Development website:

http://dx.doi.org/10.1787/9789264034358-en

Organization for Economic Cooperation and Development. (2009b). Creating effective

teaching and learning environments: First results from TALIS Retrieved from

Organization for Economic Cooperation and Development website:

http://www.oecd.org/edu/school/43023606.pdf

Organization for Economic Cooperation and Development. (2009c). PISA data analysis

manual: SPSS (2nd ed.). Paris: Organization for Economic Cooperation and

Development.

Organization for Economic Cooperation and Development. (2010a). PISA 2009 results:

What makes a school successful? Resources, policies and practices (Volume IV).

Retrieved from Organization for Economic Cooperation and Development website:

http://dx.doi.org/10.1787/9789264091559-en

Organization for Economic Cooperation and Development. (2010b). TALIS 2008 technical

report. Retrieved from Organization for Economic Cooperation and Development

website: http://www.oecd-ilibrary.org/education/talis-2008-technical-

report_9789264079861-en

146

Organization for Economic Cooperation and Development. (2012). PISA 2009 technical

report. Retrieved from Organization for Economic Cooperation and Development

website: http://dx.doi.org/10.1787/9789264167872-en

Peterson, K. D. (2000). Teacher evaluation: A comprehensive guide to new directions and

practices (2nd ed.). Thousand Oaks, CA: Corwin Press Inc.

Ravitch, D. (2010). The death and life of the great American school system: How testing

and choice are undermining education. New York, NY: Basic Books.

Reid, L. N. (2012). The unintended consequences of narrowing secondary curriculum in

response to low standardized test scores (Doctoral dissertation). Retrieved from

Dissertations and Theses database. (UMI No. 3535729)

Ribas, W. B. (2005). Teacher evaluation that works (2nd ed.). Westwood, MA: Ribas

Publications.

Rivkin, S. G., Hanushek, E. A., & Kain, J. F. (2005). Teachers, schools, and academic

achievement. Econometrica, 73(2), 417–458.

Rockoff, J. E. (2004). The impact of individual teachers on student achievement: Evidence

from panel data. American Economic Review, 94(2), 247–252.

doi:10.1257/0002828041302244

Rockoff, J. E., & Speroni, C. (2010). Subjective and objective evaluations of teacher

effectiveness. American Economic Review, 100(2), 261–266.

Rosenkvist, M. A. (2010). Using student test results for accountability and improvement: A

literature review (Working Paper, No. 54). Retrieved from OECD website:

http://dx.doi.org/10.1787/5km4htwzbv30-en

147

Rothstein, J. (2010). Teacher quality in educational production: Tracking, decay, and

student achievement. The Quarterly Journal of Economics, 125(1), 175–215.

Rubin, D.B. (1987). Multiple imputation for nonresponse in surveys. New York: John Wiley

& Sons, Inc.

Sanders, W. L., & Horn, S. P. (1994). The Tennessee value-added assessment system

(TVAAS): Mixed-model methodology in educational assessment. Journal of Personnel

Evaluation in Education, 8(3), 299–311. doi:10.1007/BF00973726

Saphier, J. (1993). How to make supervision and evaluation really work. Acton, MA:

Research for Better Teaching, Inc.

Sartain, L., Stoelinga, S. R., Brown, E. R., Luppescu, S., Matsko, K. K., Miller, F. K., &

Durwood, C. E. (2011). Rethinking Teacher Evaluation in Chicago: Lessons learned

from classroom observations, principal-teacher conferences, and district

implementation. Retrieved from Consortium on Chicago School Research website:

http://ccsr.uchicago.edu/sites/default/files/publications/Teacher%20Eval%20Report%2

0FINAL.pdf

Schütz, G., West, M. R., & Wößmann, L. (2007). Autonomy, choice, and the equity of

student achievement: International evidence from PISA 2003. Retrieved from OECD

website: http://dx.doi.org/10.1787/246374511832

Scriven, M. (1981). Summative teacher evaluation. In J. Millman (Ed.), Handbook of

teacher evaluation (pp. 244-271). Beverly Hills: Sage Publications.

Sharkey, N. S., & Murnane, R. J. (2005). Roles for the district central office. In K. P.

Boudett, E. A. City, & R. J. Murnane (Eds.), Data wise: A step-by-step guide to using

148

assessment results to improve teaching and learning (pp. 179-188). Cambridge, MA:

Harvard Education Press.

Springer, M. G., Pane, J. F., Le, V.-N., McCaffrey, D. F., Burns, S. F., Hamilton, L. S., &

Stecher, B. (2012). Team Pay for performance: Experimental evidence from the

Round Rock Pilot Project on team incentives. Educational Evaluation and Policy

Analysis, 34(4), 367–390. doi:10.3102/0162373712439094

Stronge, J. H., & Tucker, P. D. (2000). Teacher evaluation and student achievement.

Washington, DC: National Education Association.

Suen, H. K., & Yu, L. (2006). Chronic consequences of high-stakes testing? Lessons from

the Chinese civil service exam. Comparative Education Review, 50(1), 46–65.

doi:10.1086/498328

Sui-Chu, E. H., & Willms, J. D. (1996). Effects of parental involvement on eighth-grade

achievement. Sociology of Education, 69(2), 126-141. doi:10.2307/2112802

Taylor, E. S., & Tyler, J. H. (2011). The effect of evaluation on performance: Evidence from

longitudinal student achievement data of mid-career teachers (Working Paper No.

16877). Retrieved from National Bureau of Economic Research website:

http://www.nber.org/papers/w16877

Thomson, B. (2004). Exploratory and confirmatory factor analysis: Understanding

concepts and applications. Washington, DC: American Psychological Association.

Toch, T. (2008). Fixing teacher evaluation: Evaluations pay large dividends when they

improve teaching practices. Educational Leadership, 66(02), 32–37.

149

Tyler, B. J. H., Taylor, E. S., Kane, T. J., & Wooten, A. L. (2010). Using student

performance data to identify effective classroom practices. American Economic

Review, 100(02), 256–260. doi:10.1257/aer.100.2.256

UNDP (2009). Handbook on planning, monitoring and evaluating for development results.

Retrieved from the United Nations Development Program website:

http://web.undp.org/evaluation/handbook/documents/english/pme-handbook.pdf

UNESCO (2007) Evaluación del Desempeño y Carrera Profesional Docente: Una

panorámica de América y Europa, Oficina Regional de Educación para américa

Latina y el Caribe, UNESCO Santiago, 2007.

Wayman, J. C., & Stringfield, S. (2006). Technology-supported involvement of entire

faculties in examination of student data for instructional improvement. American

Journal of Education, 112(4), 549–571.

Wenglinsky, H. (2002). How schools matter: The link between teacher classroom practices

and student academic performance. Education Policy Analysis Archives, 10(12), 1–30.

West, M. R., & Peterson, P. E. (2006). The efficacy of choice threats within school

accountability systems: Results from legislatively induced experiments. The

Economic Journal, 116 (510), C46–C62.

White, B. (2004). The relationship between teacher evaluation scores and student

achievement: Evidence from Coventry, RI. Retrieved from Consortium for Policy

Research in Education website: cpre.wceruw.org/papers/CoventryAERA04.pdf

Wiggins, A., & P. Tymms (2002). Dysfunctional effects of public performance indicator

systems: A comparison between English and Scottish primary schools. Public Money

and Management, 22(1), 43-48.

150

Wößmann, L (2003). Schooling resources, educational institutions and student performance:

the international evidence. Oxford Bulletin of Economics and Statistics, 65(02), 117-

170.

Wößmann, L., Lüdemann, E., Schütz, G., & West, M. R. (2007). School accountability,

autonomy, choice, and the level of student achievement: International evidence from

PISA 2003. doi: http://dx.doi.org/10.1787/19939019

Wright, S. P., Horn, S. P., & Sanders, W. L. (1997). Teacher and classroom context effects

on student achievement: Implications for teacher evaluation. Journal of Personnel

Evaluation in Education, 11(1), 57-67.

Zhang, L. & Lee, K. A. (2011). Decomposing achievement gaps among OECD countries.

Asia Pacific Education Review, 12(3), 463–474. DOI 10.1007/s12564-011-9151-3.

151

Appendix A: Teacher Evaluations in Public Schools (2002)

Country Are all

teachers

evaluated

periodically?

Scope of

evaluation

procedures

described

Recipients

and

frequency

Evaluator

Criteria Tools Linkage to

Professional

development

Response to

ineffective

teachers

Australia1 Generally, yes State of

Victoria

Performance

and

development

plan

All teachers,

annually

Internal

(principals) and

senior teachers

State-wide

performance

standards

appropriate to

the teachers’

career stage

Demonstrated

performance(e.g.

, student

learning, data

documentation

agreed with

principal

Helps set

priorities

Salary

increment

withheld;

Improvement

plan; Further

evaluation

Austria No, only for

changes in

employment

status, for

promotion, or

as a result of

complaint

Summative

performance

evaluations

Teachers for

promotion, or

conversion to

permanent

contract

Internal;

External

(Inspection)

Student

performance;

pedagogical

knowledge of

teacher;

Permanent

teaching

performance;

In-service

training; Other

skills

Classroom

observation

No Permanent

contract not

granted;

improvement

plan Further

evaluation

Belgium

(Flemish)2

Yes Whole country All teachers,

with no fixed

periodicity

Internal

(Principals)

M M No Dismissal

Belgium

(French)

Yes Whole country All teachers,

with no fixed

periodicity

Internal

(principal);

External

M M No M

152

Appendix A: Teacher Evaluations in Public Schools (2002) (Continued)

Country Are all

teachers

evaluated

periodically?

Scope of

evaluation

procedures

described

Recipients

and

frequency

Evaluator

Criteria Tools Linkage to

Professional

development

Response to

ineffective

teachers

Canada

(Quebec)3

No, only when

teachers are

the subject of

a complaint or

for change in

employment

status

Complaint

procedure

Teachers

who are the

subject of a

complaint

Internal (school

administration)

M M Advice Improvement

Plan

Chile Yes, Both

individual and

as part of

school

evaluation;

Monetary

rewards

possible as a

result of

special

evaluation

procedure

undertaken

either on a

voluntary or

mandatory

basis

National

Teacher

Excellence

Award

50 teachers,

national

annual

competition

Peer

assessment,

school

community,

external

Community

acknowledgement

of performance

throughout career

Teacher test;

Documentation

of performance

throughout

career

Yes A

National

Performance

Evaluation

System

All teachers

in a given

school based

on school

performance,

every 2 years

External Mostly student

performance but

taking account of

school’s

socioeconomic

cluster

Set of indicators

agreed upon by

Ministry

No A

Teaching

Performance

Evaluation

System

All teachers,

every 4 years

Self-

assessment,

peer

assessment,

principal and

external

Subject and

pedagogical

knowledge,

teaching

performance and

other skills (Good

teaching

framework)

Portfolio,

interview,

classroom videos

Yes Improvement

plan; Further

evaluation;

Dismissal

153

Appendix A: Teacher Evaluations in Public Schools (2002) (Continued)

Country Are all

teachers

evaluated

periodically?

Scope of

evaluation

procedures

described

Recipients

and

frequency

Evaluator

Criteria Tools Linkage to

Professional

development

Response to

ineffective

teachers

Chile (cont.) Pedagogical

excellence

reward

Teachers on a

voluntary

basis,

annually if

teachers wish

External Subject and

pedagogical

knowledge,

teaching

performance and

other skills

Written test,

portfolio, video

Yes A

Denmark4 No, only when

teachers are

subject of a

complaint

Complaint

procedure

Teachers who

are the subject

of a complaint

Internal

(Principals)

Teaching

performance;

Other skills

Classroom

observation;

Interview

Compulsory

training

Improvement

plan;

Compulsory

training;

Further

evaluation’

Suspension;

Dismissal

France5 Yes Administrative

grade in

secondary

schools

All teachers,

annually

Internal

(principal)

Authority,

punctuality,

among others

M M Deferral of

promotion

Pedagogical

grade in

secondary

schools

All teachers,

with no fixed

periodicity

External Subject and

pedagogic

knowledge;

teaching

performance

Classroom

observation;

Interview

M Deferral of

promotion

Germany6 Generally not,

only for

promotion or

as a result of a

complaint

Land of

Baden-

Wirttember

All teachers Internal

(principals)

M M M M

154

Appendix A: Teacher Evaluations in Public Schools (2002) (Continued)

Country Are all

teachers

evaluated

periodically?

Scope of

evaluation

procedures

described

Recipients

and

frequency

Evaluator

Criteria Tools Linkage to

Professional

development

Response to

ineffective

teachers

Hungary7

At the

discretion of

the school

School

evaluation

Teachers as

part of school

evaluation;

periodically

External M M M M

Individual

teacher

evaluation

M Internal

(principal)

M M M M

Ireland All teachers

are evaluated

periodically

but in the

context of a

whole school

approach

School

evaluation

Teachers as

part of whole

school

evaluation

External Student

performance;

Subject and

pedagogical

knowledge of

teachers;

Teaching

performance

Classroom

observation

Advice In primary

and vocational

education

sectors;

Improvement

plan; Further

evaluation;

Dismissal

Italy No, unless

teacher is the

subject of

complaint

Complaint

procedure

Teachers who

are the subject

of a complaint

External M Classroom

observation

M M

Japan Generally not.

Since 2000

some

prefectural

boards of

education

introduced

teacher

evaluation

City of Tokyo All teachers,

periodically

Internal

(principals);

Self-evaluation

M Documentation

on teacher;

Interview;

Classroom

observation

Advice Deferral of

promotion

155

Appendix A: Teacher Evaluations in Public Schools (2002) (Continued)

Country Are all

teachers

evaluated

periodically?

Scope of

evaluation

procedures

described

Recipients

and

frequency

Evaluator

Criteria Tools Linkage to

Professional

development

Response to

ineffective

teachers

Korea Yes Whole country All teachers,

periodically

Internal

(principals);

Self-evaluation

M Classroom

observation;

Documentation

on teachers

No Deferral of

promotion

Mexico No, only

through a

voluntary

application to

Carrera

Magisterial

(CM) or

Escalafón

Vertical (EV),

or as a result

of a complaint.

In practice, all

the teachers

are enrolled in

EV and

around 70% of

them in CM

Carrera

Magisterial

Teachers on a

voluntary

basis,

periodically

Internal;

External

Student

performance;

Subject and

pedagogical

knowledge of

teacher; Teaching

performance; In-

service training;

Other skills

Documentation

on teacher;

Student survey;

Teacher test

No Deferral of

promotion

Escalafón

Vertical

Teachers on a

voluntary

basis

External In-service

training; Other

skills

Documentation

on teacher

No Deferral of

promotion

Netherlands Generally yes.

No regulations

exist at

national level;

school boards

responsible for

evaluation

Whole country All teachers,

periodically

Internal

(principals)

Subject and

pedagogical

knowledge of

teacher, teaching

performance;

Other skills

Classroom

observation;

Interview

Advice M

156

Appendix A: Teacher Evaluations in Public Schools (2002) (Continued)

Country Are all

teachers

evaluated

periodically?

Scope of

evaluation

procedures

described

Recipients

and

frequency

Evaluator

Criteria Tools Linkage to

Professional

development

Response to

ineffective

teachers

Norway No, only when

teachers

request it, for

promotion or

as a result of a

complaint-

either rarely

occurs. The

emphasis is on

school

evaluation

Whole country Teachers for

promotion;

Teachers who

are the subject

of a complaint

Internal

(principals)

M M M M

Slovak

Republic

Yes, teachers

are evaluated

by school

inspection, if

they are the

subject of a

complaint, and

for defining

the level of

allowances

received

School

inspection

Teachers as

part of school

evaluation

External Subject and

pedagogical

knowledge of

teacher; Teaching

performance

M M M

Allowance M M M M M M

Complaint

procedure

Teachers who

are the subject

of a

complaint

Internal

(principals)

M Classroom

observation;

Interview;

Documentation

on teacher;

Student survey

M Transfer;

Salary

reduction;

Dismissal

157

Appendix A: Teacher Evaluations in Public Schools (2002) (Continued)

Country Are all

teachers

evaluated

periodically?

Scope of

evaluation

procedures

described

Recipients

and

frequency

Evaluator

Criteria Tools Linkage to

Professional

development

Response to

ineffective

teachers

Spain No, evaluation

occurs only

when

teachers want

to become

principals,

apply for a

study leave,

and when they

are the subject

of a complaint

Application

for study leave

or

complaint

procedure

Teachers on a

voluntary

basis;

Teachers who

are the subject

of a

complaint

External Student

performance;

Subject and

pedagogical

knowledge of

teacher;

Teaching

performance

Classroom

observation;

Interview;

Documentation

on teacher;

Student survey

No M

Sweden Yes, teachers

are evaluated

by

principals and

the discussion

of

performance

includes

decisions on

rewards. This

is in a context

where the

emphasis is on

school

evaluation

Whole country Teachers as

part of school

evaluation

Internal

(principals,

peer review);

External;

Self-evaluation

Student

performance;

Subject and

pedagogical

knowledge of

teacher;

Teaching

performance; In-

service

training; Other

skills

Classroom

observation;

Interview;

Documentation

on teacher;

Student survey

Advice Improvement

plan;

Further

evaluation;

Deferral of

promotion;

Transfer

158

Appendix A: Teacher Evaluations in Public Schools (2002) (Continued)

Country Are all

teachers

evaluated

periodically?

Scope of

evaluation

procedures

described

Recipients

and

frequency

Evaluator

Criteria Tools Linkage to

Professional

development

Response to

ineffective

teachers

Switzerland Generally, yes.

The majority

of cantons

focus on

school

evaluation. A

few cantons

link teachers'

assessment

with

salaries

Canton of St.

Gallen

Teachers for

promotion

External;

Self-evaluation

Subject and

pedagogical

knowledge of

teacher; In-

service training;

Other skills

Classroom

observation;

Documentation

on teacher

Advice Deferral of

promotion

Canton of

Zürich

Teachers for

promotion

External;

Self-evaluation

m Classroom

observation;

Interview;

Documentation

on teacher

Advice Improvement

plan;

Deferral of

promotion

United

Kingdom8

Yes. Links to

salaries

possible as a

result of

special

evaluation

procedures

undertaken on

a voluntary

basis

England

(Performance

management)

All teachers,

periodically

Internal

(principals)

Subject and

pedagogical

knowledge of

teacher; Student

performance;

Other skills

Classroom

observation

Advice;

Compulsory

training

M

England,

Wales

(Threshold

assessment)

Teachers on a

voluntary

basis for

promotion

External;

Internal

(principals)

Subject/pedagogi

cal knowledge;

Student

performance; In-

service training;

Other skills;

Documentation

on teacher

Advice M

England,

Wales

(Advanced

Skills

Teacher)

Teachers on a

voluntary

basis for

promotion

External Subject/pedagogi

cal knowledge;

Student

performance;

Others

Documentation

on teacher;

Interview; Class

observation

M M

159

Appendix A: Teacher Evaluations in Public Schools (2002) (Continued)

Country Are all

teachers

evaluated

periodically?

Scope of

evaluation

procedures

described

Recipients

and

frequency

Evaluator

Criteria Tools Linkage to

Professional

development

Response to

ineffective

teachers

United

States9

Generally, yes.

Several school

districts

have

introduced

schemes

which link

teachers'

assessments to

salaries

Whole country All teachers Internal

(principals)

M Classroom

observation

Compulsory

training

Compulsory

training;

Further

evaluation

Cincinnati All teachers M Subject/pedagogi

cal knowledge of

teacher; Other

skills

M M Further

evaluation;

Salary loss

All teachers

as part of

school

evaluation

M Student

performance

M M M

Douglas

County

All teachers M M M Compulsory

training

Improvement

plan

Teachers on a

voluntary

basis

M Subject and

pedagogical

knowledge of

teacher; In-

service training;

Student

performance;

Other skills

M M M

Teachers on a

voluntary

basis as part

of school

evaluation

M Subject/pedagogi

cal knowledge;

In-service

training; Student

performance;

Other skills

M M M

160

Appendix A: Teacher Evaluations in Public Schools (2002) (Continued)

Country Are all

teachers

evaluated

periodically?

Scope of

evaluation

procedures

described

Recipients

and

frequency

Evaluator

Criteria Tools Linkage to

Professional

development

Response to

ineffective

teachers

United

States9

Generally, yes.

Several school

districts

have

introduced

schemes

which link

teachers'

assessments to

salaries

Kentucky All teachers

periodically,

as part of

school

evaluation

M Student

performance;

Other skills

M M M

Charlotte-

Mecklenburg

All teachers

periodically,

as part of

school

evaluation

M Student

performance

M M M

The following countries have special arrangements for teacher evaluation:

Finland No

evaluations,

only when

teachers are

the subject of

a complaint

No regulation exists at national level. Evaluation is at school, regional or national levels and individual

teachers are generally not evaluated. The local education provider has the responsibility for evaluation.

Based on an official complaint, individual teachers may be assessed by the provincial government.

Israel No regulation exists at national level. Once teachers obtain tenure, they are no longer evaluated. Inspectors

make an individual assessment of a teacher at the request of the principal in case performance problems are

identified.

Greece No Under a Law enacted in 2002, all individual teachers should be periodically evaluated by external

evaluators and principals. However, this scheme has not yet been implemented. Currently no systematic

teacher evaluation exists.

161

Appendix A: Teacher Evaluations in Public Schools (2002) (Continued)

Notes: This table excludes evaluations of school principals and teachers in their probationary period.

“A” Information not applicable because the category does not apply; “M” Information not available.

1. There are two evaluation schemes: summative evaluation and formative evaluation. More emphasis is given to summative performance

evaluation.

2. Job description describes the roles and tasks of teaching staff (currently for secondary teachers only). Teachers are evaluated against the job

description.

3. The complaint procedure is not regulated at Province level. Apart from this procedure, teachers are evaluated only when they go through the

probationary period or apply for tenure.

4. Evaluation of the individual performance of teachers rarely takes place, and it is mainly based on a complaint.

5. Promotion is based on a ranking of teachers for which the evaluation of performance is not the major factor. More dominant factors are years of

experience and the ranking achieved at the entrance examination.

6. Teachers are rarely evaluated after they obtain tenure except for promotion decisions and when serious performance problems arise. Moving up

to the next salary step depends essentially on years of experience.

7. There is no national scheme for the regular evaluation of individual teachers. Some forms of school-level evaluation in which teachers'

performance is evaluated have been introduced. Teachers may be provided with allowances for outstanding performance, although this

procedure is not regulated at national level.

8. An annual teacher evaluation has been introduced in England, Wales and Northern Ireland. In Scotland, annual appraisal is offered on a

voluntary basis. Some evaluation schemes linked with promotion or monetary rewards have been introduced.

9. Practices in each state differ. The table indicates general trends and some innovative practices.

Source: OECD (2005, pp.189-191), which is further derived from the Background Reports prepared by countries participating in the project and

other country-specific documents.

162

Appendix B: How School Systems use Student Assessments

Infrequent use of

assessment

or achievement data for

benchmarking

and information purposes

Frequent use of assessment

or achievement data for

benchmarking

and information purposes

Provide comparative

information to parents:

32%

Provide comparative

information to parents: 64%

Compare the school

with other schools: 38%

Compare the school with other

schools: 73% Monitor progress over

time: 57%

Monitor progress over time:

89% Post achievement data

publicly: 20%

Post achievement data publicly:

47% Have their progress

tracked

by administrative

authorities: 46%

Have their progress tracked

by administrative authorities:

79%

Infrequent

use

of assessment

or

achievement

data for

decision

making

Make curricular

decisions: 60%

Allocate

resources: 21%

Monitor teacher

practices: 50%

Austria, Belgium,

Finland, Germany,

Greece, Ireland,

Luxembourg,

Netherlands, Switzerland,

Liechtenstein

Hungary, Norway, Turkey,

Montenegro, Tunisia, Slovenia

Frequent use

of assessment

or

achievement

data for

decision

making

Making

curricular

decisions: 88%

Allocating

resources: 40%

Monitor teacher

practices: 65%

Denmark, Italy, Japan,

Spain,

Argentina, Macao-China,

Chinese Taipei, Uruguay

Australia, Canada, Chile, Czech

Republic, Estonia, Iceland, Israel,

Korea, Mexico, New Zealand,

Poland, Portugal, Slovak

Republic,

Sweden, United Kingdom, United

States, Albania, Azerbaijan,

Brazil, Bulgaria, Colombia,

Croatia,

Dubai (UAE), Hong Kong-China,

Indonesia, Jordan, Kazakhstan,

Kyrgyzstan, Latvia, Lithuania,

Panama, Peru, Qatar, Romania,

Russian Federation, Shanghai-

China, Singapore, Thailand,

Trinidad and Tobago, Serbia

Source: OECD (2010a, p.78).

163

Appendix C: Criteria for Teacher Appraisal and Feedback (2007-08)

Student

test

scores

Retention

and pass

rates

Other student

learning

outcomes

Student

feedback on

teaching

Feedback

from

parents

Relations

with

colleagues

Australia 51.4 51.8 62.1 58.4 54.7 69.7

Austria 45.2 19.7 51.5 70.9 73.4 73.7

Belgium (Fl.) 53.2 52.0 47.9 59.1 51.4 78.3

Brazil 78.0 78.4 84.1 88.4 76.7 87.9

Bulgaria 88.4 72.6 78.5 81.0 64.2 85.5

Denmark 28.6 25.3 44.5 60.7 56.4 70.0

Estonia 72.1 65.8 77.4 79.2 71.7 75.0

Hungary 55.2 56.8 71.3 67.2 72.6 76.4

Iceland 44.9 40.3 52.8 78.6 76.3 77.8

Ireland 72.0 70.9 67.7 59.4 66.8 74.0

Italy 62.5 59.8 82.5 85.9 89.2 89.6

Korea 66.3 32.4 59.2 62.2 56.1 64.4

Lithuania 62.8 50.9 74.0 82.3 80.1 78.8

Malaysia 95.7 57.0 91.0 94.1 83.9 94.3

Malta 56.2 55.4 64.3 71.3 70.2 77.6

Mexico 84.5 86.6 77.9 82.9 66.7 75.3

Norway 47.3 41.6 55.8 59.9 68.2 79.3

Poland 87.2 66.2 84.6 82.8 86.6 89.3

Portugal 64.4 75.2 71.0 82.7 73.3 80.5

Slovak Republic 76.0 48.8 68.0 81.7 70.4 74.2

Slovenia 61.4 45.6 61.6 60.3 59.8 73.1

Spain 69.5 73.9 66.5 54.9 59.7 60.8

Turkey 72.6 65.9 79.2 71.7 61.5 75.7

164

Appendix C: Criteria for Teacher Appraisal and Feedback (2007-08) (Continued)

Direct

appraisal

of

classes

Innovation

in

teaching

Relation

with

students

Professional

development

undertaken

Classroom

management

Content

knowledge

Australia 59.9 66.5 80.1 48.8 69.8 72.4

Austria 77.6 69.8 85.7 44.5 77.7 76.4

Belgium (Fl.) 77.5 67.2 82.5 63.9 74.4 73.3

Brazil 90.1 87.7 93.7 83.1 89.6 92.5

Bulgaria 88.9 80.4 90.1 85.5 92.1 91.4

Denmark 40.7 35.7 75.7 46.4 61.6 47.1

Estonia 78.2 77.0 90.4 79.4 86.1 86.0

Hungary 80.2 69.6 80.2 55.5 82.1 89.7

Iceland 44.1 57.0 84.0 50.0 66.6 66.4

Ireland 69.5 68.6 86.1 58.0 84.7 82.4

Italy 79.9 79.9 94.7 75.5 94.6 92.2

Korea 67.8 62.6 69.8 63.5 74.3 64.8

Lithuania 80.1 80.0 89.8 67.7 81.3 89.8

Malaysia 96.3 96.2 96.6 91.0 96.6 97.8

Malta 77.1 68.2 84.2 47.1 83.1 78.4

Mexico 86.6 80.9 84.9 76.4 79.2 88.1

Norway 48.4 40.4 86.2 50.8 73.5 72.1

Poland 94.3 87.1 94.8 87.0 91.3 94.6

Portugal 55.3 69.4 90.9 66.4 76.4 78.6

Slovak Republic 83.3 79.0 83.3 62.1 72.6 82.7

Slovenia 76.1 68.7 80.7 53.2 68.7 78.0

Spain 62.0 59.5 75.8 55.3 75.7 65.6

Turkey 75.3 75.3 79.1 71.1 82.0 79.0

165

Appendix C: Criteria for Teacher Appraisal and Feedback (2007-08) (Continued)

Pedagogical

knowledge

Teaching

students with

special

learning

needs

Student

discipline

and behavior

Teaching in

multicultural

settings

Extracurricular

activities with

students

Australia 66.7 41.2 63.1 29.1 51.7

Austria 71.8 53.5 77.3 33.7 65.0

Belgium (Fl.) 72.5 54.3 64.9 31.6 52.0

Brazil 91.1 68.0 88.0 76.5 81.2

Bulgaria 90.5 61.7 85.8 68.9 83.0

Denmark 41.1 39.5 56.3 22.9 42.5

Estonia 87.0 60.2 84.5 33.9 69.8

Hungary 89.0 65.5 81.7 52.0 73.4

Iceland 62.4 48.8 68.2 22.9 25.9

Ireland 80.1 56.4 79.9 40.1 63.5

Italy 90.3 81.5 92.5 70.6 77.9

Korea 68.1 45.8 68.7 31.8 37.1

Lithuania 88.0 61.4 80.5 48.9 73.5

Malaysia 97.5 49.2 94.8 81.9 81.4

Malta 73.4 44.9 79.5 32.6 61.3

Mexico 87.7 64.2 85.5 67.8 66.2

Norway 63.1 55.2 72.6 21.0 22.3

Poland 94.7 71.5 95.1 40.0 80.3

Portugal 78.9 58.2 80.2 47.9 72.9

Slovak Republic 83.9 62.2 80.6 44.0 65.6

Slovenia 79.3 52.1 65.2 27.1 58.6

Spain 63.4 66.2 79.1 56.0 59.8

Turkey 77.6 54.0 74.5 53.6 67.6

Note: Only includes those teachers who received appraisal or feedback. Percentage of teachers of

lower secondary education who reported that the above criteria were considered with high or moderate

importance in the appraisal and/or feedback they received.

Source: OECD (2009b, p. 179-80).

166

Appendix D: Impact of Teacher Appraisal and Feedback upon Teaching (2007-08)

Classroom

manageme

nt practices

Knowledge or

understanding

of the teacher’s

main subject

field(s

Knowledge or

understanding

of

instructional

practices

A teacher

development or

training plan to

improve their

teaching

Australia 24.1 19.4 22.1 18.4

Austria 21.9 16.4 24.9 16.7

Belgium (Fl.) 20.5 16.7 20.1 16.4

Brazil 60.1 59.9 59.2 52.9

Bulgaria 68.4 58.8 62.2 56.5

Denmark 18.2 10.9 11.1 12.4

Estonia 30.3 32.7 35.7 28.9

Hungary 36.2 24.3 32.2 44.7

Iceland 24.0 20.3 23.0 36.9

Ireland 25.2 18.7 24.5 21.3

Italy 33.4 32.2 38.8 38.7

Korea 36.0 45.1 48.1 48.6

Lithuania 39.4 50.1 54.2 46.1

Malaysia 86.7 88.5 89.2 81.6

Malta 24.6 20.0 21.5 25.3

Mexico 74.8 69.1 71.3 74.1

Norway 28.5 23.0 21.1 24.0

Poland 45.5 31.3 38.2 47.6

Portugal 22.4 18.8 23.0 26.8

Slovak Republic 36.4 42.8 44.8 35.7

Slovenia 47.6 34.8 44.0 46.1

Spain 25.2 12.5 16.6 20.5

Turkey 35.2 33.3 36.3 39.4

167

Appendix D: Impact of Teacher Appraisal and Feedback upon Teaching (2007-08) (Continued)

Teaching

students with

special

learning needs

Student

discipline

and

behavior

Teaching in

multicultural

settings

The emphasis placed

on improving

student test scores in

teaching

Australia 14.2 21.0 8.1 24.7

Austria 18.6 20.4 8.3 19.5

Belgium (Fl.) 19.1 20.1 8.2 19.6

Brazil 26.8 53.7 44.0 65.6

Bulgaria 41.5 63.3 44.1 74.5

Denmark 13.9 19.5 6.3 19.3

Estonia 19.4 26.9 10.8 30.4

Hungary 32.2 32.4 19.8 30.4

Iceland 22.8 30.0 12.6 26.6

Ireland 19.3 23.4 12.0 26.7

Italy 37.2 36.9 29.5 44.0

Korea 33.5 47.0 21.4 39.7

Lithuania 32.2 43.7 23.0 46.7

Malaysia 45.7 83.9 73.9 91.5

Malta 17.7 25.7 9.6 31.3

Mexico 42.0 67.1 53.1 76.7

Norway 24.2 28.6 7.0 25.7

Poland 26.4 31.9 10.8 53.9

Portugal 21.4 26.9 14.7 35.5

Slovak Republic 31.3 26.9 18.9 41.1

Slovenia 38.3 45.8 15.2 52.1

Spain 22.9 27.2 17.0 24.6

Turkey 25.9 40.0 26.7 43.0

Note: Only includes those teachers who received appraisal or feedback. Percentage of teachers of

lower secondary education who reported that the appraisal and/or feedback they received directly

led to or involved moderate or large changes in the above.

Source: OECD (2009b, p. 187).

168

Appendix E: Outcomes of Teacher Appraisal and Feedback (2007-08)

A change in

salary

Financial reward

or bonus

Career

advancement

Public

recognition

Australia 5.6 1.6 16.9 24.1

Austria 1.1 1.7 4.7 27.1

Belgium (Fl.) 0.4 0.1 3.7 20.7

Brazil 8.2 5.5 25.6 47.8

Bulgaria 26.2 24.2 11.6 64.9

Denmark 2.2 2.7 4.7 25.3

Estonia 14.3 19.8 10.5 39.6

Hungary 9.4 25.1 10.7 40.2

Iceland 7.5 9.3 8.6 18.3

Ireland 3.5 1.4 13.3 24.8

Italy 2.0 4.0 4.9 46.4

Korea 5.2 8.3 12.7 31.0

Lithuania 17.3 22.0 14.3 55.4

Malaysia 33.0 29.0 58.2 58.6

Malta 1.7 1.2) 8.2 19.3

Mexico 10.6 7.3 28.6 33.4

Norway 7.0 3.0 6.9 25.6

Poland 14.5 26.5 39.2 55.7

Portugal 1.7 0.6 6.2 26.3

Slovak Republic 19.7 37.3 20.8 40.7

Slovenia 14.2 19.4 39.4 43.3

Spain 1.8 1.6 8.6 25.1

Turkey 2.2 3.6 13.5 42.6

169

Appendix E: Outcomes of Teacher Appraisal and Feedback (2007-08) (Continued)

Professional development

opportunity

Change in work

responsibility

Role in school

development initiative

Australia 16.7 17.4 24.1

Austria 8.0 14.7 17.2

Belgium (Fl.) 7.1 11.9 10.1

Brazil 27.8 47.7 41.6

Bulgaria 42.4 28.2 49.5

Denmark 25.6 19.0 16.3

Estonia 35.6 21.7 31.3

Hungary 22.8 12.3 28.7

Iceland 20.5 18.1 19.2

Ireland 13.4 16.0 23.2

Italy 19.2 27.1 38.3

Korea 17.1 24.1 24.9

Lithuania 42.4 39.9 42.8

Malaysia 50.8 76.4 64.1

Malta 7.8 15.1 16.7

Mexico 27.2 55.9 34.4

Norway 21.3 14.5 22.4

Poland 38.2 24.6 42.1

Portugal 11.3 25.3 25.3

Slovak Republic 28.7 30.0 35.9

Slovenia 36.2 24.5 28.7

Spain 13.2 16.9 20.7

Turkey 12.1 33.7 24.4

Note: Only includes those teachers who received appraisal or feedback. Percentage of teachers of

lower secondary education who reported that the appraisal and/or feedback they received led to a

moderate or large change in the above aspects of their work and careers.

Source: OECD (2009b, p. 181).

170

Appendix F: Variable Definitions and Measurements

Variables (original in

parenthesis) Definition Measurement

Developmental

Monitoring in Test-

Language

1) Student achievement

(stachvmnt)

2) Peer reviews

(trprvw)

3) Principal and staff

observations

(prstffobs)

4) External

observations (extob)

Principals’ pedagogical

role and use of student

assessments for

improving instruction

5) Principals’

observation of

classes (obsclsspisa)

6) Principals’

suggesting teachers

for improvement

(sggsttrs)

7) Principals informing

teachers about

possibilities for

updating their

Student achievement used to evaluate teachers in test language

Teacher peer reviews used to evaluate teachers in test language

Principal and staff observations used to evaluate teachers in test

language

External observations used to evaluate teachers in test language

I observe instruction in classrooms

I give teachers suggestions as to how they can improve their

teaching

I inform teachers about possibilities for updating their knowledge

and skills

Assessments of students used to identify aspects of instruction or the

curriculum that could be improved

Categorical: For each of the variables 1-4,

principals were asked if any of the following

approaches were used to monitor practice of

teachers in test language. Response measured

as Yes=1, No=2. ‘Yes’ dummy coded as 1.

Categorical: Variables 5-7 seek principals’

responses to the item “Below you can find

statements about your management of this

school. Please indicate the frequency of the

following activities and behaviors in your

school during the last school year.” Responses

recorded as 1=Never, 2=Seldom, 3=Quite

often, 4=Very often. Response 3 and 4

dummy coded as 1.

Variable 8 seeks principals’ responses to the

item: “In your school, are assessments of

171

Appendix F: Variable Definitions and Measurements (Continued)

Variables (original in

parenthesis) Definition Measurement

knowledge and skills

(infmtrknwupdte)

8) Student assessments

used for instructional

improvement

(instrctnlimp)

High-stakes

9) Public accountability

(pbaccnt)

10) Student assessments

used for evaluating

teachers (treval)

11) Student assessments

tracked by

administrative

authority (admntrck)

12) Student assessments

used to judge teacher

effectiveness

(jdgtreffct)

Interactions

13) Obstalisinfmpar

14) obstalistrsalin

Achievement data are posted publicly (e.g., in the media)

Achievement data are used in evaluation of teachers' performance

Achievement data are tracked over time by an administrative

authority

Assessments of students used to make judgments about teachers’

effectiveness

Classroom observations given moderate to high importance in

teacher evaluation x parents are informed about their children’s

progress

Classroom observations given moderate to high importance in

teacher evaluation x principal is responsible for making salary

changes

students in <national modal grade for 15-year-

olds> used for any of the following

purposes?” Yes=1, No=2. ‘Yes’ dummy

coded as 1.

For each of the variable 9-11, principals were

asked if achievement data is used in any of the

accountability procedures (Achievement data

include aggregated school or grade-level test

scores or grades, or graduation rates.

Response measure as Yes=1, No=2. ‘Yes’

dummy coded as 1.

Variables 13 to 16 are interactions between

variables within school level as well as

between school and country levels.

172

Appendix F: Variable Definitions and Measurements (Continued)

Variables (original in

parenthesis) Definition Measurement

15) obspisaprivatei

16) evledexp

Control Variables

(Student Level)

17) Student sex (girl)

18) Student age

(stage)

19) Student grade

(grade)

20) First generation

immigrant (immig1)

21) Second generation

immigrant (immig2)

22) Home language

(hlangothr)

Principal observes classes x independent private school

Teacher evaluation x dollars spent on education

Student gender: “Are you female or male?” dummy coded as 1

Age of Student: On what date were you born?

Student grade: “What <grade> are you in?

First generation immigrant

Second generation immigrant

Language spoken at home is other than test language

Categorical: Female= 1 Male=2

Continuous: AGE = (100 + Ty – Sy) + (Tm –

Sm)/12. (Ty and Sy: year of the test and the

year of the students’ birth of the tested

student, Tm and Sm are the month of the test

and month of the students’ birth respectively.

Results rounded to two decimal places.

Continuous: This is relative grade index that

indicates if a student is at modal grade (value

of 0), above (+) or below (-) modal grade in

the country.

Categorical: Information on country of birth

of student, mother and father and index on

immigrant background obtained and

categories made as native (1), second

generation (2, dummy coded as 1), and 1st

generation (3, dummy coded as 1).

Categorical: Based on language spoken at

home: 1 = home language is the same as test

language, 2 = home language is other than test

language (2 dummy coded as 1).

173

Appendix F: Variable Definitions and Measurements (Continued)

Variables (original in

parenthesis) Definition Measurement

23) Socioeconomic

status (escs)

Control Variables

(School Level)

24) Principal’s sex

(sc27q01)

25) School type

(public)

26) School size

(schsize)

27) Teacher shortage

(tcshort)

28) Proportion of

qualified teachers

(propqual)

PISA Index of Educational, Social and

Cultural Status using other indices on HISEI, PARED, and

HOMEPOS. HOMEPOS comprises of information on cultural

possessions, books, educational resources, wealth

Principal’s gender: Are you female or male?

School type: Is your school a public or a private school?

School size: As at <February 1, 2009>, what was the total school

enrolment (number of students)?

Teacher shortage: Is your school’s capacity to provide instruction

hindered by any of the following issues?

Proportion of qualified teachers in school.

Continuous: ESCS = β1HISEI’ +β2 PARED’

+ β3HOMEPOS’/Ɛf (β1, β2 and β3 are the

OECD factor loadings, HISEIʹ, PAREDʹ and

HOMEPOSʹ the “OECD-standardized”

variables and Ɛf is the eigenvalue of the first

principal component.

Categorical: Female = 1, Male = 2. Female

dummy coded as 1.

Categorical: (1) public schools controlled

and managed by a public education authority

or agency, (2) government dependent private

schools (receive more than 50% of their core

funding from government) (3) government-

independent private schools (receive less than

50% of their core funding from government).

Public dummy coded as 1.

Continuous: This index that carries the total

enrollment including boys and girls in school.

Continuous: This is an index of teacher

shortage based principals’ perception of the

factors affecting instruction at school.

Continuous: Index of proportion of qualified

teachers (ISCED 5A) (propqual= ISCED 5A

teachers/total number of teachers)

174

Appendix F: Variable Definitions and Measurements (Continued)

Variables (original in

parenthesis) Definition Measurement

29) Proportion of girls

(pcgirl)

30) Student teacher ratio

(stratio)

31) Proportion of

computers connected

to the web

(compweb)

Control Variables

(Country)

32) Professional

outcomes

33) Others

34) Outcomes and

impacts of teacher

appraisals

35) Educational

expenditure (edexp)

Proportion of girls in school

Student-teacher ratio.

Proportion of computers connected to web that can be used by

students in the modal grade for 15 year olds.

This is a factor obtained through principal component analysis and

regression scores. It comprises of country variables (teacher

percentages): student test scores, retention and pass rates, other

student learning outcomes, direct appraisal of classes, innovation in

teaching, professional development undertaken

This is the second factor on teacher evaluation criteria: feedback

from parents, relations with colleagues

This component is obtained through principal component analysis

and regression scores on country variables (teacher percentages) on

outcomes and impacts of teacher evaluations: change in salary,

public recognition, career advancement, emphasis placed on

improving student test scores, professional development

opportunity, and role in school development initiatives, and change

in work responsibilities.

Dollars spent on education (obtained by multiplying gdp and

expenditure on education

Continuous: number of girls/total enrollment

Continuous: stratio=school size/total number

of teachers

Continuous: compweb = number of

computers for educational purposes connected

to the web/number of computers for

educational purposes available to students in

the modal grade for 15-year-olds.

Continuous: generated as regression scores

after component analysis

Continuous: generated as regression scores

after component analysis

Continuous: generated as regression scores

after component analysis

Continuous: gdppp*expenditure

Source: Based on information from school and student questionnaires, school and student codebooks, and PISA 2009 technical report of PISA 2009 survey.

175

Appendix G: Principal Component Analysis of Criteria for Teacher Appraisal and Feedback

Table G1

Principal Components/Correlation for Teacher Appraisal and Feedback

Component Eigenvalue Difference Proportion Cumulative

Comp1 5.97 4.82 0.75 0.75

Comp2 1.15 0.76 0.14 0.89

Comp3 0.39 0.19 0.05 0.94

Comp4 0.20 0.05 0.02 0.96

Comp5 0.14 0.04 0.02 0.98

Comp6 0.10 0.08 0.01 0.99

Comp7 0.03 0.01 0.00 1.00

Comp8 0.02 . 0.00 1.00

Table G2

Promax Rotated Component Loadings of Criteria for Teacher Appraisal and Feedback

Variable Comp1 Comp2 Unexplained

Student test scores 0.36 -0.49 0.09

Retention and pass rates 0.34 -0.43 0.22

Other student learning outcomes 0.38 0.11

Feedback from parents on teaching 0.56 0.11

Relations with colleagues 0.51 0.06

Direct appraisal of classes 0.38 0.13

Innovation in teaching 0.39 0.07

Professional development undertaken 0.39 0.09

176

Table G3

Scoring Coefficients for Components on Criteria for Teacher Appraisal and Feedback

Variable Comp1 Comp2

Student test scores 0.37 -0.49

Retention and pass rates 0.35 -0.44

Other student learning outcomes 0.38 0.03

Feedback from parents on teaching 0.25 0.56

Relations with colleagues 0.29 0.50

Direct appraisal of classes 0.38 -0.01

Innovation in teaching 0.39 0.05

Professional development undertaken 0.39 0.01

177

Appendix H: Principal Component Analysis of Outcomes and Impacts of Teacher Appraisal and

Feedback

Table H1

Principal Components/Correlation of Outcomes and Impacts of Teacher Appraisal and

Feedback

Factor Eigenvalue Difference Proportion Cumulative

Comp1 4.41 3.62 0.74 0.74

Comp2 0.80 0.36 0.13 0.87

Comp3 0.44 0.23 0.07 0.94

Comp4 0.21 0.11 0.04 0.98

Comp5 0.10 0.07 0.02 0.99

Comp6 0.03 . 0.01 1.00

Table H2

Promax Rotated Component Loadings of Outcomes and Impacts of Teacher Appraisal

and Feedback

Variable Comp1 Uniqueness

Emphasis placed on improving student test scores 0.40 0.31

Change in salary 0.41 0.27

Career advancement 0.41 0.25

Public recognition 0.34 0.49

Professional development opportunity 0.45 0.10

Role in school development 0.43 0.17

178

Table H3

Scoring Coefficients for Component on Outcomes and Impacts of Teacher Appraisal and

Feedback

Variable Comp1

Emphasis placed on improving student test scores 0.40

Change in salary 0.41

Career advancement 0.41

Public recognition 0.34

Professional development opportunity 0.45

Role in school development 0.43

CURRICULUM VITAE

Gulab Khan

Education

2013 Doctor of Philosophy, Educational Theory and Policy

College of Education, Pennsylvania State University, University Park, United

States

2005 Master of Education, Educational Leadership and Management

Aga Khan University, Institute for Educational Development, Karachi, Pakistan

1999 Master of Science, Chemistry

Quaid-i-Azam University, Islamabad, Pakistan

1996 Bachelor of Science

University of the Punjab, Lahore, Pakistan

Experience

2010-2013 Head of Monitoring, Evaluation and Research (Currently on study leave)

Aga Khan Education Service, Pakistan

2008 Academic Coordinator

Aga Khan Education Service, Pakistan

2006-2010 Principal

Aga Khan Education Service, Pakistan

2005-2006 Vice/Acting Principal

Aga Khan Education Service, Pakistan

1999-2005 Lecturer Chemistry

Aga Khan Education Service, Pakistan

Publications/Research

Khan, G. (2010). Exploring principal-student relationships in a private secondary school in

Pakistan. In Khaki, J. A., & Safdar, Q. (Eds). Educational leadership in Pakistan: Ideals and

Realities (pp. 129-150). Karachi: Oxford University Press.

Khan, G. (2008). Lost sailor gets ashore. In Bashiruddin, A., & Retallick, J. (Eds.) (2008).

Becoming a teacher in the developing world. A monograph. AKU-IED Publications.

Zhang, L., Khan, G., Tahirsylaj, A. (Work in progress). Student Performance, School

Differentiation, and World Cultures: Evidence from PISA 2009.