119
STRUCTURAL VALIDITY AND RELIABILITY OF TWO OBSERVATION PROTOCOLS IN COLLEGE MATHEMATICS by LAURA ERIN WATLEY JIM GLEASON, COMMITTEE CHAIR YUHUI CHEN DAVID CRUZ-URIBE KABE MOEN JEREMY ZELKOWSKI A DISSERTATION Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of Mathematics in the Graduate School of The University of Alabama TUSCALOOSA, ALABAMA 2017

STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

STRUCTURAL VALIDITY AND RELIABILITY OF

TWO OBSERVATION PROTOCOLS IN

COLLEGE MATHEMATICS

by

LAURA ERIN WATLEY

JIM GLEASON, COMMITTEE CHAIRYUHUI CHEN

DAVID CRUZ-URIBEKABE MOEN

JEREMY ZELKOWSKI

A DISSERTATION

Submitted in partial fulfillment of the requirementsfor the degree of Doctor of Philosophy

in the Department of Mathematicsin the Graduate School of

The University of Alabama

TUSCALOOSA, ALABAMA

2017

Page 2: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

Copyright Laura Erin Watley 2017ALL RIGHTS RESERVED

Page 3: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

ABSTRACT

Undergraduate mathematics education is being challenged to improve, with peer eval-

uation, student evaluations, and portfolio assessments as the primary methods of formative

and summative assessment used by instructors. Observation protocols like the Mathemat-

ics Classroom Observation Protocol for Practices (MCOP2) and the abbreviated Reformed

Teaching Observation Protocol (aRTOP) are another alternative. However, before these

observation protocols can be used in the classroom with confidence, a study needed to be

conducted to examine both the aRTOP and the MCOP2. This study was conducted at three

large doctorate-granting universities and eight masters and baccalaureate institutions. Both

the aRTOP and the MCOP2 were evaluated in 110 classroom observations during the Spring

2016, Fall 2016, and Spring 2017 semesters. The data analysis allowed conclusions regarding

the internal structure, internal reliability, and relationship between the constructs measured

by both observation protocols.

The factor loadings and fit indices produced from a Confirmatory Factor Analysis (CFA)

found a stronger internal structure of the MCOP2. Cronbach’s alpha was also calculated

to analyze the internal reliability for each subscale of both protocols. All alphas were in

the satisfactory range for the MCOP2 and below the satisfactory range for the aRTOP.

Linear Regression analysis was also conducted to estimate the relationship between the

constructs of both protocols. We found a positive and strong correlation between each

pair of constructs with a higher correlation between subscales that do not contain Content

Propositional Knowledge. This leads us to believe that Content Propositional Knowledge of

the aRTOP is measuring something completely different, but not very well, and it needs to

be assessed using another method. As noted above and detailed in the body of the work, we

find support for the Mathematics Classroom Observation Protocol for Practices (MCOP2)

as a useful assessment tool for undergraduate mathematics classrooms.

ii

Page 4: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

DEDICATION

This dissertation is dedicated to my parents and my husband. To my parents, Douglas

and Edith: Thank you for your unconditional love, guidance, and support. You have always

believed in me and encouraged me to strive for my dreams. I would not be who I am today

without you. To my husband, Kyle: Thank you for the unwavering love, support, and

encouragement. You have made my dreams yours and given me the strength to accomplish

them.

iii

Page 5: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

ACKNOWLEDGMENTS

The completion of this Dissertation would not have been possible without the support

and guidance of a few very special people in my life. I would first like to give thanks to our

Lord and Savior for leading me on this path. It is only through his grace and mercy for

without him none of this would be possible.

Next I would like to thank Dr. Jim Gleason for his endless support and encouragement.

You have been a patient and caring mentor during this process. I cannot tell you how much I

value the time and effort you have put into me and my aspirations. I would also like to thank

the other members of my dissertation committee: Dr. Yuhui Chen, Dr. David Cruz-Uribe,

Dr. Kabe Moen, and Dr. Jeremy Zelkowski. I am forever grateful for the invaluable input

that has led to a strong dissertation.

To the Mathematics Department at The University of Alabama, you hold my gratitude

for dedicating your time to sharing your passion for mathematics with students like me. I

would like to thank my entering Department Chair and Graduate Program Director, Dr.

Zhijian Wu and Dr. Vo T. Liem, for accepting me into the program and encouraging me

at the beginning of this process. To the current Department Chair and Graduate Program

Director, Dr. David Cruz-Uribe and Dr. David Halpern, your encouragement and advisement

in these last years has been vital to my success.

To the MTLC instructors at The University of Alabama, it is because of you that I am

the teacher I am today. You have instilled in me a sense of what it is to love mathematics

and to share that love with others. I will never forget all you have taught me and shared

with me over the years.

To my fellow graduate students at The University of Alabama, I cannot imagine this

experience with anyone else. To Bryan Sandor and Anne Duffee, I am so glad we found each

other. You both have been there for me when the challenges of graduate school seemed too

great to overcome. The University of Alabama will always hold a special place in my heart.

iv

Page 6: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

To the seventy-two mathematics instructors that selflessly allowed me to observe your

class for this study. You have done more than just open your classroom to me, you have

opened my eyes to new ideas and expanded my love for teaching. To the institutions that

allowed me to observe, I will always cherish the time I spent on your campus.

To the mathematics department at Troy University, you have instilled in me the foun-

dation that has led to my dissertation. You not only shared your passion of mathematics,

but you opened my eyes to the limitless possibilities in mathematics. I will never forget your

kind words and support. Troy University will always hold a special place in my heart.

I want to also acknowledge my family members who constantly supported me and be-

lieved that I could achieve my goals. To my parents, Douglas and Edith Watley, thank you

for your relentless encouragement, unfailing support, and unconditional love. None of this

would have been possible without you. Finally I want to thank my husband, Kyle Scarbrough

and our furry friend Wesley. You both have stood by me throughout this process. You have

been patient with me when I needed it, you celebrate with me when even the littlest things

went right, and you loved me through it all.

v

Page 7: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

TABLE OF CONTENTS

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii

DEDICATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

CHAPTER 1 - INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

CHAPTER 2 - LITERATURE REVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

Content Knowledge for Teaching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

Student Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

Reliability and Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Peer Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

Portfolios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

Observation Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

Reformed Teaching Observation Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

Mathematics Classroom Observation Protocol for Practices . . . . . . . . . . . . . . . . . . 27

CHAPTER 3 - METHODS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

Aim of Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

Instruments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

CHAPTER 4 - RESULTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

Internal Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

Internal Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

Relationship between the Constructs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

vi

Page 8: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

CHAPTER 5 - DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .53

Study Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

Future Direction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .59

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .60

APPENDICES

APPENDIX A. OVERVIEW OF OBSERVATION PROTOCOLS . . . . . . . . . . . . . . . . . . . . 74

APPENDIX B. INSTRUCTOR DEMOGRAPHICS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .79

APPENDIX C. INSTRUMENTS USED . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

APPENDIX D. REGRESSION MODELS AND RESIDUAL PLOTS . . . . . . . . . . . . . . . . 96

APPENDIX E. IRB CERTIFICATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .103

vii

Page 9: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

List of Tables

1 Subscales as Predictors of the RTOP Total Score . . . . . . . . . . . . . . . 24

2 Interpretation of the RTOP Factor Pattern . . . . . . . . . . . . . . . . . . 25

3 aRTOP Items and Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4 Brief Description of MCOP2 items . . . . . . . . . . . . . . . . . . . . . . . . 30

5 Number of Observations at each Institution . . . . . . . . . . . . . . . . . . 34

6 Recommendations for Model Evaluation: Some Rules of Thumb . . . . . . . 43

7 Simple Linear Regression Results . . . . . . . . . . . . . . . . . . . . . . . . 51

8 Pearson’s Product-Moment Correlation . . . . . . . . . . . . . . . . . . . . . 52

9 Demographics Characteristics of the Sample . . . . . . . . . . . . . . . . . . 80

viii

Page 10: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

List of Figures

1 Theoretical Model of aRTOP . . . . . . . . . . . . . . . . . . . . . . . . . . 38

2 Theoretical Model of MCOP2 . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3 Confirmatory Factor Analysis Results: aRTOP . . . . . . . . . . . . . . . . 45

4 Confirmatory Factor Analysis Results: MCOP2 . . . . . . . . . . . . . . . . 46

5 Residual Plots of Regression Model 1 . . . . . . . . . . . . . . . . . . . . . . 48

6 Regression Model 1: Student Engagement and Inquiry Orientation . . . . . . 97

7 Regression Model 2: Student Engagement and Inquiry Orientation . . . . . . 98

8 Regression Model 3: Teacher Facilitation and and Inquiry Orientation . . . . 99

9 Regression Model 4: Teacher Facilitation and Content Propositional Knowledge100

10 Regression Model 5: Inquiry Orientation and Content Propositional Knowledge101

11 Regression Model 6: Student Engagement and Teacher Facilitation . . . . . 102

ix

Page 11: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

CHAPTER 1

INTRODUCTION

The colleges and universities in the United States are being challenged to improve Sci-

ence, Technology, Engineering, and Mathematics (STEM) undergraduate education (Boyer

Commission on Educating Undergraduates in the Research University, 1998; National Re-

search Council, 1996, 1999, 2002, 2012; National Science Foundation, 1996, 1998), with

college and university STEM professors asked to lead this challenge. These same college and

university professors are experts in their area of study, have received multiple degrees, and

make contributions to their field resulting in awards and publications. However, they have

had little or no formal training in teaching and learning, and obtain most, if not all, of their

professional development in education during graduate school as teaching assistants. Once

they finish graduate school, the primary professional development comes from reflection of

the formative and summative assessments from peer observation, student evaluations, and

an assessment of their portfolio. Although each of these evaluation methods can provide use-

ful information, it is difficult to compare and analyze the information obtained from these

methods. In general these methods have broad questions that create subjective information

with low concurrence among raters.

A useful tool in the process of improving the quality of Science, Technology, Engineering,

and Mathematics (STEM) education is the development of aggregate methods to quantify the

state of teaching and learning in order to compare different teaching and learning strategies.

Observation protocols provide a quantifiable method useful for improving and strengthening

STEM undergraduate education. The two most common uses for observation protocols are to

support professional development and to evaluate teaching quality (Hora & Ferrare, 2013b).

Observation protocols provide a way to collect numerical data representing observed variables

1

Page 12: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

describing the classroom environment and activities. This data can then be systematically

analyzed using statistical techniques and create meaningful ways to evaluate the scholarship

that professors use in their teaching.

The quantifiable understanding we gain from the use of observation protocols is in-

valuable. College and university professors can use this information to identify personal

strengths and weaknesses. They can easily compare and contrast the information obtained

from semester to semester to see a visual growth in their teaching effectiveness. The use of

observation protocols opens the door for professors to assess their teaching effectiveness in

different types of classrooms. The information that can be gained from observation protocols

is limitless at both the individual level and collectively for the university.

Although there are a multitude of observation protocols in use (see Appendix A), the

Mathematics Classroom Observation Protocol for Practices (MCOP2) and an abbreviated

Reformed Teaching Observation Protocol (aRTOP) are the most applicable toward the aim

of this study. As the Reformed Teaching Observation Protocol is one of the most widely

used observation protocols in the mathematics classroom, and the Mathematics Classroom

Observation Protocol for Practices (MCOP2) is used to measure the practices of students

and teachers in the mathematics classroom and how they align with the process standards

of the National Council of Teachers of Mathematics (2000); the Standards for Mathemati-

cal Practice from the Common Core State Standards in Mathematics (National Governors

Association Center for Best Practices, Council of Chief State School Officers, 2010); rec-

ommendations from “Crossroads” and “Beyond Crossroads” of the American Mathematical

Association of Two-Year Colleges (1995, 2004); the Committee on the Undergraduate Pro-

gram in Mathematics Curriculum Guide from the Mathematical Association of America

(Barker et al., 2004); and the Conference Board of the Mathematical Sciences statement on

“Active Learning in Post-Secondary Mathematics Education” (2016). The MCOP2 is a 16

item protocol that measures the two primary constructs of teacher facilitation and student

engagement.

2

Page 13: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

The Reformed Teaching Observation Protocol (RTOP) was designed by the Evaluation

Facilitation Group of the Arizona Collaborative for Excellence in the Preparation of Teachers

to measure “reformed” teaching and is said to be standards based, inquiry oriented, and stu-

dent centered (Piburn & Sawada, 2000). RTOP is a 25-item classroom observation protocol

on a 5 point Likert scale that measures the three primary constructs of lesson design and

implementation, content, and classroom culture. Although the RTOP has been the most

widely used observation protocol for mathematics classrooms during the past 10 to 15 years,

the review of literature revealed serious issues with the proposed structure and reliability

that led us to select the ten items we call the abbreviated Reformed Teaching Observation

Protocol (aRTOP).

There are limitations that we must account for in this study. One limitation is the use

of convenience sampling to collect the data. Time and travel costs have forced us to use this

sampling technique with our sample chosen strategically to include classroom observations

not likely to give us unusual data by sampling from a diverse range of institutions, based

on enrollment demographics and types of degrees offered, reasonably representing the larger

population of undergraduate institutions in the United States. The potential for observer

bias is another limitation to this project. Such biases could include gender, ethnicity, age,

teaching methodology, and course structure. Although it is impossible to remove the human

element from this study, we were aware of the potential for observer bias and made every

effort to avoid it. Being cognitive of potential biases and taking them into account is the

key strategy for avoiding researcher bias (Johnson & Christensen, 2014).

The goal of this project is to have a clear understanding of both the MCOP2 and the

aRTOP and the relationship between these two protocols as it relates to undergraduate

mathematics classrooms. Therefore, we hypothesize the following research questions:

1. What are the internal structures of the Mathematics Classroom Observation Protocol

for Practices (MCOP2) and the abbreviated Reformed Teaching Observation Protocol

(aRTOP) for the population of undergraduate mathematics classrooms?

3

Page 14: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

2. What are the internal reliabilities of the subscales of the Mathematics Classroom Ob-

servation Protocol for Practices (MCOP2) and the abbreviated Reformed Teaching Ob-

servation Protocol (aRTOP) with respect to undergraduate mathematics classrooms?

3. What are the relationships between the constructs measured by the Mathematics Class-

room Observation Protocol for Practices (MCOP2) and the abbreviated Reformed

Teaching Observation Protocol (aRTOP)?

4

Page 15: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

CHAPTER 2

LITERATURE REVIEW

Increased accountability in higher education has fostered a need to evaluate and develop

the effectiveness of undergraduate teaching in mathematics. The call for accountability

creates a demand in postsecondary institutions to provide quantifiable evidence of the ef-

fectiveness of their academic programs (National Research Council, 2002). Unfortunately,

there is no widely accepted definition or agreed upon criteria of effective teaching at the

undergraduate level (Clayson, 2009). However, there is growing consensus that active learn-

ing is a critical component of the college mathematics classroom (Conference Board of the

Mathematical Sciences, 2016). Student evaluations, peer evaluations, observation protocols,

and portfolios are the most common methods used in the current environment to evaluate

teaching effectiveness. This chapter will give a brief summary of each of the above listed

methods of measuring teaching effectiveness and review the benefits and barriers to each

evaluation method.

Content Knowledge for Teaching

What makes someone an effective teacher? Is having a strong understanding of teaching

procedures enough? Or is strong subject matter knowledge the key to effective teaching?

Shulman saw a strong disregard of the content being taught in the policies of the 1980s.

Shulman (1986) did not want to belittle the importance of the pedagogical skills being high-

lighted in these policies, but rather bring attention to the importance of content knowledge

for teachers by creating a theoretical framework to model the categories of content knowledge

5

Page 16: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

which he identifies as subject matter content knowledge, pedagogical content knowledge, and

curricular knowledge.

Content knowledge for teaching refers to the amount and organization of knowledge in

the minds of the teachers. The understanding of facts and concepts is only a part of subject

matter content knowledge. It requires a much deeper understanding of the structure of the

subject matter. It is not enough for a teacher to merely understand something, but they

must also know why it is so, when it can be applied, and when is it weakened or no longer

applies (Shulman, 1986).

Subject matter knowledge is necessary, but not a sufficient condition for someone to be

an effective teacher. Shulman’s second category of content knowledge, pedagogical content

knowledge, is a combination of the teacher’s subject matter knowledge and the knowledge

utilized to teach that subject. A few of the examples Shulman provides of pedagogical

content knowledge are (a) the knowledge needed to represent and formulate the subject that

makes it comprehensible to others, (b) the knowledge of what makes a particular subject

difficult or easy to comprehend, (c) the knowledge of conceptions and misconceptions that

students bring with them from previous learning (Shulman, 1986).

Curricular knowledge is the knowledge of what programs and materials are designed to

teach a specific subject to a given student level. It also includes the knowledge of the variety

of materials available to teach a specific subject. Most importantly, it is the knowledge

teachers use to select or reject a particular curriculum in a given circumstance. In addition

to the knowledge of curriculum materials, curricular knowledge includes lateral curriculum

knowledge (relationship of the content to other subjects) and vertical curriculum knowledge

(relationship of the content to previous and future learning of the same subject).

Shulman’s theoretical framework was designed to focus on the nature and type of knowl-

edge needed for teaching a subject. He did not provide us with a list of necessary knowl-

edge for any particular subject area, rather Shulman’s paper acted as a catalyst for other

researchers to expand on his ideas into their particular subjects. In 2008, Ball and her

6

Page 17: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

colleagues examined and expanded Shulman’s ideas in the context of mathematics. Ball,

Thames, & Phelps (2008) developed in more detail Shulman’s idea of subject matter knowl-

edge and pedagogical content knowledge for teachers in the context of mathematics.

Ball, Thames, & Phelps (2008) divided content knowledge into two categories, subject

matter knowledge and pedagogical content knowledge. This theoretical model did not include

a third category like Shulman’s model. Ball et al., (2008) decided curricular knowledge was

a domain of subject matter knowledge and pedagogical content knowledge. Ball, Thames,

& Phelps (2008) divided subject matter knowledge into three domains: common content

knowledge (CCK), specialized content knowledge (SCK), and horizon content knowledge

(HCK). Ball et al., (2008) also expanded Shulman’s idea of pedagogical content knowledge

into three domains: knowledge of content and students (KCS), knowledge of content and

teaching (KCT), and knowledge of content and curriculum (KCC).

Common content knowledge (CCK) is the mathematical knowledge that teachers use,

but is not specialized to the work of teachers. Teachers must be able to work a problem

correctly, recognize an incorrect answer, and generally know the content. Ball et al., (2008)

believe there is common content knowledge (CCK) which teachers use that is also used in

settings other than teaching. Although this is considered “common” that does not make

it any less important. Students’ knowledge of the subject will be negatively affected if a

teacher does not have common content knowledge.

Specialized content knowledge (SCK) is defined as the knowledge and skills unique to

teaching mathematics. Ball et al., (2008) state, “this work (SCK) involves an uncanny

kind of unpacking of mathematics that is not needed – or even desirable – in settings other

than teaching” (p. 400). This is the knowledge that is only needed when teaching mathe-

matics. For example, recognizing students’ common errors and understanding nonstandard

approaches by students are specialized content knowledge (SCK) that only a teacher would

need. The distinction between common content knowledge and specialized content knowl-

7

Page 18: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

edge, while clear in the elementary school context, becomes more difficult to measure at the

undergraduate level.

Horizon content knowledge (HCK) is the third domain of subject matter knowledge and

corresponds to portions of Shulman’s curricular knowledge. This HCK is the knowledge of

how to introduce a specific topic with the prior and future understandings of this topic in

mind. For example, knowing how inequalities will be taught in a later class would influence

how one would introduce the number line. Teachers should have the knowledge of how a

particular subject will be used in the future in order to introduce it in the best way. There

is still some concern over whether this should be solely in the category of subject matter

knowledge or if it should be included in other categories (Ball et al., 2008).

The idea of pedagogical content knowledge was also divided into three domains: knowl-

edge of content and students (KCS), knowledge of content and teaching (KCT), and knowl-

edge of content and curriculum (KCC). Ball, Thames, & Phelps (2008) tell us, “two domains

- knowledge of content and students (KCS) and knowledge of content and teaching (KCT)

coincide with the two central dimensions of pedagogical content knowledge identified by Shul-

man” (p.402). KCS is the combination of the knowledge teachers know about their students

and mathematics. Teachers must understand how their students will approach a particular

problem and the struggles they will encounter. Alternatively KCT is the combination of the

knowledge teachers know about mathematics and teaching. For example, teachers have to

know the order to introduce topics and what to spend more time on.

Ball also included Shulman’s curriculum knowledge as a domain of pedagogical content

knowledge based on the work of Grossman, Wilson, & Shulman (1989). The definition of

knowledge of content and curriculum (KCC) is unclear by Ball et al., (2008). Although Ball

et al., (2008) placed curriculum knowledge under pedagogical content knowledge, there is

still some concern if it should only be there or in several different categories.

Shulman (1986) poses the question of how expert students become novice instructors.

We must ask ourselves, how do teachers acquire knowledge of teaching? Most college and

8

Page 19: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

university professors are experts in the content they are teaching, but most do not have any

formal background in education or pedagogical training. Most professors have participated

in little to no formal teacher preparation and typically have not taken any education courses

(Speer & Hald, 2008). The majority of training is based on the limited supervised training

that professors obtain during their graduate teaching assistantship in graduate school.

The research of Speer & Hald (2008) assert that mathematics education research in K-12

has sought to document the extent teachers possess pedagogical content knowledge and the

effect it has on students learning and teaching practices. Similar research in higher education

is just now emerging and is relatively scarce. The research available on higher education

pedagogical content knowledge is focusing on Graduate Teaching Assistants (GTA) and their

training programs. The dissertation of Ellis (2014) gives us a wealth of information on GTA

professional development programs and GTA beliefs and practices. Another dissertation

focused on the differences in the beliefs and practice of international and U.S. domestic

mathematical teaching assistants (Kim, 2011). Kung & Speer (2007) focus their research on

the need for professional development activities for GTAs and the empirical research needed

to create these activities.

Being knowledgeable in mathematics is necessary, but alone is not a sufficient condition

for an instructor to create good learning opportunities for students (Speer & Hald, 2008). If

we could improve the mathematic instructor’s knowledge of student thinking, Kung & Speer

(2007) believe this will foster better learning opportunities for students. The hope is this

will, in turn, lead to improved student achievement in higher education.

Student Evaluations

With the increase in accountability within higher education, student evaluations are

becoming even more widely used as a measure of quality in university teaching. Clayson

(2009) brought to our attention that student evaluations of teaching are one of the most

well researched, documented, and long lasting debates in the academic community. In fact,

9

Page 20: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

d’Apollonia and Abrami (1997) stated “most postsecondary institutions have adopted stu-

dent ratings of instruction as one (often the most influential) measure of instructional effec-

tiveness” (p. 1198). Chen and Hoshower (2003), as well as, Benton, Cashin, and Kansas

(2012) propose that student evaluations are commonly used to provide formative feedback to

faculty for improving teaching, course content and structure; a summary measure of teach-

ing effectiveness for promotion and tenure decisions; and information to students for the

selection of courses and teachers.

Student evaluations of instruction were first introduced to the United States in the mid

1920’s (Algozzine et al., 2004). Since then there have been waves of research, including

studies which have verified the validity and effectiveness of student ratings (Algozzine et al.,

2004; Benton & Cashin, 2012; Cashin, 1995; Centra, 1993, 2003, 2009; Centra & Gaubatz,

2000; Clayson, 2009; Davis, 2009; L. Ellis, Burke, Lomire, & McCormack, 2003; Marsh, 1984,

2001; Marsh, Hau, Balla, & Grayson, 1998; Marsh, Hau, & Wen, 2004; Marsh & Hocevar,

1991; Marsh & Roche, 2000; Marsh & Ware, 1982; Socha, 2013; Sojka, Gupta, & Deeter-

Schmelz, 2002; Shevlin, Banyard, Davies, & Griffiths, 2000; Ware Jr & Williams, 1975;

Wachtel, 1998). However, student evaluations have not always been met with complete

acceptance, and so we think it is important to now discuss some of the most common

misconceptions.

The literature on student evaluations varies widely. While others believe some of these

items are factual, Benton (2012), Feldman (2007), and Kulik (2001) believe student evalu-

ations are (a) only a measure of showmanship, (b) indicators of concurrence only at a low

level, (c) unreliable and invalid, (d) time and day dependent, (e) student grade dependent,

(f) not useful in the improvement of teaching, and (g) affected by leniency in grading result-

ing in high evaluations. These myths seem to persist even though there is over fifty years of

credible research showing the reliability and validity of student evaluations. This research

has been ignored for reasons that include personal biases, suspicion, fear, ignorance, and

general hostility towards any evaluation process (Feldman, 2007; Benton & Cashin, 2012).

10

Page 21: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

Since teaching is comprised of many characteristics, Spooren, Brockx, and Mortelmans

(2013) believe it is widely accepted that student evaluations are considered multidimensional.

Jackson et al. (1999) warns there has been a dispute in the research as to the number of

dimensions or the nature of these dimensions. This causes the student evaluation instruments

to vary greatly in the item content and the number of items.

In the 1990’s, researchers including Abrami & D’apollonia (1990) debated the use of

global constructs for the evaluation of teaching effectiveness. Eventually they came to a

compromise that the use of both specific dimensions and global measures could be used

for an overall rating. More recent research supports the multidimensionality of teaching,

by reporting that higher order factors can reflect general teaching effectiveness (Apodaca

& Grad, 2005; Burdsal & Harrison, 2008; Cheung, 2000). The research of Burdsal and

Harrison(2008) and Spooren et al. (2013) provides evidence that both a multidimensional

profile and an overall evaluation are valid indicators of students’ perception of teacher effec-

tiveness.

Reliability and Validity

Reliability refers to consistency, stability, and generalizability of data, and in the context

of student evaluations, most often refers to the consistency of the data (Cashin, 1995). The

consistency of student evaluations is highly influenced by the number of raters. In general,

the more raters the more dependable the ratings are. Also, multiple classes provide more

reliable information than a single class. Benton et al. (2012) suggests the use of more than

one class if there are fewer than 10 raters in order to improve reliability.

The validity of student evaluations have been extensively debated over the years with

researchers often disagreeing as to the extent to which student evaluations measure the

construct of teaching effectiveness. A primary driver in this debate is the lack of agreement on

what defines effective teaching. One method to determine the validity of student evaluations

11

Page 22: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

involves its relation to other forms of evaluation. The agreement or disagreement of these

other evaluation methods can give us greater insight into the validity of student ratings.

Logically, the best way to measure effective teaching would be to base it on the resulting

student learning and understanding. One would assume that a teacher who has high student

evaluations would also have students that are highly successful. Davis (2009) states, “Ratings

of overall teaching effectiveness are moderately correlated with independent measures of

student learning and achievement. Students of highly rated teachers achieve higher final

exam scores, can better apply course material, and are more inclined to pursue the subject

subsequently” (p. 534).

In a study at Minot State University, Ellis, Burke, Lomire, and McCormack (2003) found

that courses with the highest average grades were taught by teachers who received the highest

rating from their students. The study was composed of 165 undergraduate courses taught by

24 instructors. Ellis reported a weak, but significant positive correlation (r = 0.35, p < .01)

existing between average ratings of teachers and average grades received by students. They

warn this relation maybe due to numerous factors, but it is most likely that giving higher

grades to students results in more favorable student evaluations. Clayson (2009) affirms that

“as statistical sophistication has increased over time, the reported learning/SET (student

evaluation of teaching) relationship has generally become more negative” (p. 26).

Despite the comparison of student evaluations to colleague ratings; expert judges ratings;

graduating seniors and alumni ratings; and student learning providing evidence of validity of

the student evaluations, many researchers are still concerned with the ability of students to be

easily swayed by superficialities (Socha, 2013). Also, the concern of students’ abilities to be

effective evaluators of teaching competency, and its relationship to student learning, plagues

researchers. Algozzine et al. (2004) warns that student ratings should only be influenced by

characteristics that represent effective teaching and not by sources of bias. Marsh (1984)

defines a bias of student ratings as “substantially and causally related to the ratings and

relatively unrelated to other indicators of effective teaching” (p. 709).

12

Page 23: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

One of the most controversial and often most discussed concern is that high ratings can

be solely based on the faculty member’s “entertaining” ability. The Dr. Fox Effect, as it

is also known, is a study where an actor lectures (Ware Jr & Williams, 1975). Although

“Dr. Fox” did not cover any material, he received a high rating because of his “entertaining”

value. Wachtel (1998) states, “This was thought to demonstrate that a highly expressive

and charismatic lecturer can seduce the audience into giving undeservedly high ratings” (p.

200). Since the original study, Marsh (1982) cites several experts in the field that have raised

question as to its validity.

In classrooms where there are incentives to understand the material, earlier studies found

that content covered has a much greater impact on student ratings then expressiveness. So-

jka, Gupta, & Deeter-Schmelz (2002) found that students and teachers have a different

perception of how a faculty member’s “entertaining” ability effects student ratings. Faculty

believed that the ability to entertain has a great influence on ratings, while students strongly

disagreed. Shevlin, Bamyard, Davies, and Griffiths (2000) state that expressiveness of teach-

ers is positively correlated with student evaluations regardless of the content taught. They

found that the charisma factor accounted for 69% of the variation in the rating of a teacher’s

ability as determined by student ratings (Shevlin et al., 2000).

The relationship between gender and student evaluations remains undetermined. One

study (L. Ellis et al., 2003) found that the gender of the instructor was not significantly

correlated with the student ratings. In another study (Centra, 2009) gender preferences

were found, mainly the ratings of female instructors by female students. The research study

of Centra and Gaubatz (2000) agreed with this conclusion but warned that even though

these ratings are statistically significant, they have little practical importance.

In comparison to other instructor variables, there is a relatively small amount of quan-

titative data exploring this relationship. According to Merritt (2008), there is a lack of em-

pirical research examining the relationship between race and student evaluations. A study

conducted by Hamermesh & Parker (2005) of 436 classes reported minority faculty members

13

Page 24: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

received lower teaching evaluations than majority instructors. Non-native English speakers

also received substantially lower ratings than their native speaking counterparts.

Logically, faculty ranking will have an impact on student evaluations. In a study con-

ducted by Centra (2009) with 1539 teaching assistants the overall evaluation of the quality

of teaching in a course had a mean score of 3.83 on a 5 point scale. While their higher rank-

ing colleagues, assistant professors and above, scored about a third of a standard deviation

higher on the overall evaluation. There is some question as to whether ranking or years of

experience are being represented in this study since the two correlate. However, Ellis et al.

(2003) found there was no significant correlation between years taught and the ratings of the

same instructor by students.

Like instructor variables, individual student variables can also influence evaluations of

teaching. Variables studied include age, gender, motivation, and personality of the student.

Also, individual academic characteristics of the student have been studied. Some of these

variables include scholastic level of the student, GPA, and reason for taking the course. Age

(Centra, 1993), gender (Feldman, 1977, 1993), and the level of students (McKeachie, 1979)

are not currently being researched, but have been in the past.

Student GPA and college required classes are two of the individual academic character-

istics that are currently being researched. In “Tools for Teaching”, Davis (2009) summarizes

the research on the relationship between student evaluations and student GPA. Citing sev-

eral authors, Davis (2009) concludes that there is little to no relationship noted for this

particular variable (Marsh & Roche, 2000; Abrami, 2001). Conversely, research has found

a slight bias against college-required courses. This is understandable given students may

be required to take a class in which they have little interest or background. Centra (2009)

suggests even though there is only a slight bias, institutions should take this into account

when reviewing student evaluation data.

The expected grade is probably the most researched student variable related to student

evaluation of instruction. Eiszler (2002) found that student evaluations are a small con-

14

Page 25: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

tributor to grade inflation over time. Centra (2009) reports the correlation of .20 between

expected grades and teacher effectiveness. While Ellis et al. (2003) states, “the magnitude

of the correlation has been in the range of .35 to .50, meaning that roughly 12% to 25%

of the variance in ratings might be accounted for by varying grading standards” (p. 39),

they mention several researchers, including Mehdizadeh (1990) and Krautmann & Sander

(1999), that found a positive correlation between the expected (or received) course grade

and student evaluations.

We also recognize that the actual courses have variables that the instructor cannot influ-

ence. For instance, class size, topic difficulty, and the level of the course are all characteristics

of the course beyond the control of the instructor. The time of day a class is taught is another

course variable that has been of interest to researchers in the past (Aleamoni, 1981; Feldman,

1978). The relationship between student evaluations of teaching and course characteristics

has been the subject of research over the years, and the results have been inconsistent.

Student evaluations did not significantly correlate with the level of the course according

to Ellis et al. (2003). However, lower level classes generally receive lower ratings than higher

level classes. This is especially true for graduate level classes, however this difference tends

to be small (Benton & Cashin, 2012). Benton et al. (2012) suggests the development of

local comparative data to help control for this difference. Class size can also have an effect

on student evaluations. Most researchers have found that larger classes cause instructors to

receive lower evaluations. Ellis et al. (2003) reports that class size was correlated significantly,

while Hoyt and Lee (2002) found that it was not always statistically significant.

The academic discipline of the class being taught can affect the student ratings. In a

study by Centra (2009), an average mean of 3.87 on a 5 point scale was found for courses in

natural sciences, mathematics, engineering, and computer science. While the overall rating

for humanities (English, history, language) was a mean of 4.04. This was about a third

of a standard deviation difference. Some have attributed this difference to the growth of

knowledge in these natural sciences causing teachers to cover increasing amounts of material.

15

Page 26: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

The meta-analysis by Clayson (2009) supported these differences and stated that academic

disciplines are important variables to consider when reviewing student evaluation data.

Course load and difficulty are correlated to student evaluations, but not largely. Sur-

prisingly the correlation is positive, because students tend to give higher ratings to more

difficult courses that call for hard work (Marsh, 2001). Centra (2003) used a large data base

of classes and not surprisingly found that both too elementary and too difficult classes were

rated poorly. The findings indicated that classes balanced in the middle were the highest

rated classes.

Consideration must be given to the manner (paper vs. electronic) in which the stu-

dent evaluations are collected. Ballantyne (2003), Bullock (2003), Spooren et al. (2013),

and Tucker, Jones, Straker, and Cole (2003) offer us the following reasons for the move

from paper to electronic student evaluations; timely and accurate feedback, no interruption

in class time, more accurate analysis of data, ease of access to students, greater student

anonymity, decreased faculty influence, more detailed written comments, and lower cost and

time demand for administrators.

One of the major concerns of online student evaluations is the response rate. Online

survey response is much lower than that of traditional paper surveys with Dommeyer, Baum,

Hanna, and Chapman (2004) reporting an average response rate of 70% for in-class surveys

and 29% for online surveys. To help with low response rates, Dommeyer et al., (2004) suggest

using incentives to motivate students to complete the online survey. In the study conducted

by Leung & Kember (2005), they compared paper and electronic versions of the same survey.

Leung and Kember (2005) found no significant differences between the data obtained from

paper and electronic evaluations. These results lead us to conclude that the differences in

manner (paper vs. electronic) did not affect the validity of student evaluations.

Since the very first reports on student evaluations by Remmers and Brandenburg (1928,

1930; 1927), there have been thousands of reports covering various topics on these evalua-

tions. Student evaluations can provide useful information about the instructor’s knowledge,

16

Page 27: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

organization and preparation, and ability to communicate clearly. According to Chen and

Hoshower (2003), “while the literature supports that students can provide valuable informa-

tion on teaching effectiveness given that the evaluation is properly designed, there is a great

consensus in the literature that students cannot judge all aspects of faculty performance” (p.

73). Despite the controversies, student evaluations are still the most widely used evaluation

method. In general, researchers are in agreement that no single source of evaluation, includ-

ing student evaluations, can provide sufficient information in order to make valid judgments

on effective teaching.

Peer Evaluations

Compared to the extensive research on student evaluations of teaching, few studies exist

on peer evaluations and are limited in scope. The National Research Council (2002) found

that direct observation of teachers over an extended period of time by their peers can be

a highly effective means of evaluating an individual instructor. Even though professional

accountability in higher education has grown over the years, peer evaluations are not a

dominant practice in the assessment of teaching at most colleges and universities (Thomas,

Chie, Abraham, Raj, & Beh, 2014).

The scope of peer evaluations is not limited to what can be observed in a classroom, but

can include course outlines, syllabi, and teaching materials. Hatzipanagos and Lygo-Baker

(2006) suggest that peer reviews include observation of lectures and tutorials, monitoring

online teaching, examining curriculum design, and the use of student assessments. Peer

evaluations also create ways to improve on the adherence of the ethical standards set forth

by the university. Based on the above, we note that peer evaluations are more than just

classroom observations and can be instrumental in curriculum and professional development.

There are many benefits of peer review in developing faculty members. Peer reviews fur-

ther the development of teachers through the expert input from colleagues’ experience and

knowledge (Kohut, Burnap, & Yon, 2007). Peer evaluations are not just about identifying

17

Page 28: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

places that need improvement, but also strengths. The benefits concluded from the literature

by Thomas et al. (2014) include the validation of teaching practices already being imple-

mented, inspiring different teaching perspectives, fostering learning about teaching methods,

and development of peer respect. Both the observer and the teacher being observed can use

this evaluation process to reflect on how to improve their teaching methods (Kohut et al.,

2007).

According to Bernstein, Jonson, and Smith (2000), only feedback gained from knowl-

edgeable peers leads to growth of teaching to its greatest potential. However, Thomas et

al. (2014) warns that peer evaluations are most beneficial towards quality teaching develop-

ment if the peer review program includes clear, straightforward, and transparent structure;

engagement in professional discussion and debate among participants; focus on the develop-

ment of teaching and learning to maintain the motivation and commitment toward the peer

review process; and willingness to consider that difficulties that may arise when engaging in

professional development activities.

Unfortunately, there are also many barriers to peer review of teaching unless the observa-

tions are part of a carefully conceived, systematic process (Wachtel, 1998). One of the major

barriers of peer observation is the low level of concurrence among observers due to personal

bias of teaching behaviors and inexperienced observers. Although faculty are experts in their

area of study, most do not have any formal training in education. Another barrier is that

peer evaluations generally are not a part of the culture of teaching and learning. Researchers

seem to agree that peer evaluation must be coupled with other evaluation methods in order

to provide accurate information.

Despite these reservations, peer evaluations are still an effective way to improve teaching.

Peer evaluation can provide the opportunity for faculty to learn how to be more effective

teachers, to get regular feedback on their classroom performance and to receive support from

colleagues. Educators advocate multiple sources for teaching improvement or for teaching

evaluation, and classroom observations provide a source of input that can be balanced against

18

Page 29: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

some of the other more common forms of instructional feedback such as student evaluations

(Wachtel, 1998). Most importantly peer evaluation can provide a third party observation of

what is occurring in a college classroom. This visualization can foster a renewed satisfaction

in teaching.

It is becoming obvious to increasing numbers of faculty that successful teachers are

not only experts in their fields of study but also knowledgeable about teaching strategies

and learning theories and styles, committed to the personal and intellectual development

of their students, cognizant of the complex contexts in which teaching and learning occur,

and concerned about colleagues’ as well as their own teaching (Keig & Waggoner, 1994).

The use of peer evaluations can provide a wealth of information that can lead to enhanced

teaching. Although there are numerous problems that cause concern to the validity of peer

evaluations, it can provide a vast amount of knowledge when coupled with other evaluation

methods.

Portfolios

Unlike other evaluation methods, which can only shed light on a small part of a teacher’s

effectiveness, portfolios have the ability to convey a broad range of a teacher’s skills, attitude,

philosophies, and achievement. Seldon and Miller (2009) define a portfolio as a reflective,

evidence-based collection of materials that document teaching, research, and service. A

professor’s portfolio usually includes an assertion about their teaching effectiveness along

with supporting documentation (Burns, 2000). This could include a sample syllabi, student

work, student ratings, and comments from both students and colleagues.

There are many benefits of portfolios. Portfolios are not simply an exhaustive collection

of all the documents and materials a teacher has, but rather a balance listing of professional

activities that provide evidence of teacher effectiveness (Seldin & Miller, 2009). They can

allow faculty to exhibit their teaching accomplishments to colleagues (Laverie, 2002). Burns

(2000) states that some institutions are beginning to require a portfolio as part of their

19

Page 30: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

post tenure review. The key benefits of a portfolio, according to Seldin (2000), are that it

encourages faculty to reflect on their teaching and to improve their teaching.

Portfolios also have many negative qualties. Although there are numerous researchers

that praise the portfolio’s ability to improve teaching, Burns (2000) affirms that there are

no experiments that support this claim and goes on to even state, “The only experiment

that I could locate that compared teaching ratings before and after portfolio construction

concluded that these ratings did not improve significantly” (p. 45). When the impact of a

mandatory portfolio was studied by some researchers, the concern was that the creation of

the portfolio was the focus and not the improvement of teaching. Some of the other concerns

of faculty are: Is the time and energy that is takes to prepare a portfolio worth it? Does

the administration know how to use the information collected from the portfolio? For new

faculty, would a portfolio not be counterproductive?

With all these questions being posed, there is little research being conducted to answer

these questions. Although a portfolio has the ability to be a very useful tool in the assessment

of teaching effectiveness, without the reliability and validity of this instrument being known,

what do they really represent? Given the research that exists we have to view portfolios

with some reservation. Like all other evaluation methods, portfolios cannot stand alone but

is one more tool that if combined with other methods can be useful in evaluating teacher

effectiveness.

Observation Protocols

Classroom observations are direct observations of teaching practices, where the observer

either takes notes and/or codes teaching data live in the classroom or from a recorded video

lesson. The two most common uses for observation protocols are to support professional

development and to evaluate teaching quality (Hora & Ferrare, 2013b). We note that while

classroom observations are a very common practice in K-12 schools, observations are less

common in postsecondary settings with further theoretical development and testing needed.

20

Page 31: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

Observation protocols development for K-12 is more advanced due in part to policies govern-

ing teaching evaluations (Hora, 2013). Postsecondary observation protocols are traditionally

less developed in terms of psychometric testing and conceptual development(Hora & Fer-

rare, 2013b). Unfortunately observation protocols in higher education trail far behind that

of K-12 (Pianta & Hamre, 2009). The most recently developed and currently utilized obser-

vation protocols in colleges and universities center on science, technology, engineering, and

mathematics (STEM) teaching (Hora & Ferrare, 2013b).

The development of aggregate methods of improvement in the quality of STEM edu-

cation is on the minds of institutions, disciplines, and national agencies (Seymour, 2002).

Smith, Jones, Gilbert, and Wieman (2013) cite several of these agencies that stress more

effective teaching in STEM courses, such as the President’s Council of Advisors on Science

and Technology Engage to Excel report (2012) and the National Research Council Discipline-

Based Education Research report (2012). The shift in teaching and learning of science and

mathematics towards student centered instruction and active learning is increasing (Freeman

et al., 2014; Gasiewski, Eagan, Garcia, Hurtado, & Chang, 2012; Michael, 2006).

In The Greenwood Dictionary of Education, student-centered learning (SCL) is defined

as an “approach in which students influence the content, activities, materials, and pace of

learning” (Collins & O’Brien, 2003, p. 338-339). If SCL is applied correctly it can lead to a

growth in student enthusiasm to learn, retention of knowledge, understanding, and attitude

towards the subject being taught. Michael (2006) defines active learning as engaging the

students in activities that require some sort of reflection on the ideas. Students should be

actively gathering information, thinking, and problem solving during a class that uses active

learning. The meta-analysis by Freeman et al. (2014) of classrooms using active learning

reported that the average examination score improved by 6 % over traditional lecturing. He

also reported that students in traditional lecturing classes were 1.5 times more likely to fail

than those in active learning classes.

21

Page 32: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

Sawada et al. (2002) warns that the development and use of an evaluation instrument

that supports these efforts is problematic and controversial and higher education institutions

find it difficult to identify alignment of teaching to this construct. Walkington et al. (2012)

believes that classroom observations are one of the best methods to combine with student

achievement to get a measure of teaching effectiveness. However, “generic observation in-

struments aimed at all disciplines and employed by observers without disciplinary knowledge

are not sufficient” (Walkington et al., 2012, p. 3). A protocol that is generic enough to be

useful in a mathematics and history class will lack complete understanding of the learning

and teaching process (Hora & Ferrare, 2013b). It is not reasonable to expect that a protocol

can be useful and generic enough to work for all different types of subject matter given the

obvious differences between disciplines.

There are two main types of observation protocols; unstructured (open-ended) and struc-

tured (Hora & Ferrare, 2013b). Unstructured protocols may not even indicate what the ob-

server should be looking for and in general do not have fixed responses. Although responses

to open-ended questions can be very useful to the observer and the instructor, the data is

very dependent on the observer and cannot easily be standardized (Smith et al., 2013). This

leads to difficulty to compare this data across multiple classrooms.

On the other hand, observers respond to a structured protocol with a common set of

statements or codes (Smith et al., 2013). The data that is produced is easily standardized

and can be used to compare multiple classrooms. The drawback to most structured protocols

is the requirement of some sort of multi-day training in order to have inter rater reliability

(Sawada et al., 2002). Observers must also pay close attention to the behavior of the teacher

and/or the students to assess the predetermined classroom dynamics.

It is impossible to include all the observation protocols that are used to evaluate under-

graduate courses, but Appendix A presents a brief summary of some of the existing protocols.

The two protocols used for this study are described in more detail below.

22

Page 33: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

Reformed Teaching Observation Protocol

The Reformed Teaching Observation Protocol (RTOP) is probably the most widely

used STEM-specific observation protocol to date. This instrument was designed by the

Evaluation Facilitation Group of the Arizona Collaborative for Excellence in the Preparation

of Teachers (ACEPT) to measure “reformed” teaching. Sawada et al. (2002) tells us that

during the development of the RTOP the Evaluation Facilitation Group (EFG) affirmed

that “the instrument would have to be focused on both science and mathematics, standards

based, focused exclusively on reform rather than the generic characteristics of good teaching,

easy to administer, appropriate for classrooms K-20, valid, and reliable” (p. 246).

RTOP is a 25-item classroom observation protocol on a 5 point Likert scale that is said

to be standard based, inquiry oriented, and student centered. The items are divided into

three subsets: Lesson Design and Implementation (5), Content (10), and Classroom Culture

(10). The first subset containing items 1-5 are designed to capture what the reference manual

calls the ACEPT model for reformed teaching. The second subset focuses on content and

is divided into two parts. These are Propositional Pedagogic Knowledge (items 6-10) and

Procedural Pedagogic Knowledge (items 11-15). The third subset is also divided into two

equal parts that analyze the classroom culture called Communication Interaction (items

16-20) and Student/Teacher Relationships (items 21-25).

After the initial development testing and redesign, a team of nine trained observers col-

lected 287 RTOP forms from the observation of over 141 mathematics and science classrooms.

The team consisted of seven graduate students and two faculty members. The classrooms

included ranged from middle school, high school, community colleges, and universities. Of

the 141 classrooms observed, only 38 (27%) were mathematics classrooms. Of the math-

ematics classrooms included only 13 (34%) came from community college and university

observations. Since less than 10% of the sample focused on the undergraduate mathematics

classroom, and since these were exclusively mathematics courses designed for pre-service

23

Page 34: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

elementary teachers, a more thorough analysis is necessary to determine the reliability and

structure of the instrument for general undergraduate mathematics classrooms.

Using the data collected by the nine trained observers, the inter rater reliability was

obtained by computing a best-fit linear regression of the observation of one observer on

those of the other with a correlation coefficient of 0.98 giving a shared variance between

observers of 95%. Additionally, Cronbach’s alpha for the whole instrument was reported to

be an astonishing 0.97, implying a high degree of uniformity across items, with the sub-scale

alphas ranging from 0.80 to 0.93 (Piburn & Sawada, 2000; Sawada et al., 2002). This verifies

that the RTOP has extremely strong internal consistency and can likely retain a reasonable

reliability with significantly fewer items.

The RTOP is divided into 5 sub-scales in order to test the hypothesis that “Inquiry-

Orientation” is a major part of the structure of RTOP (Piburn & Sawada, 2000). The

subscales and their R-squared totals are in Table 1. Piburn & Sawada note that the high

R-squares offer a very strong support of the construct validity. However, such high pre-

dictability of the total score by four of the sub-scales implies, at most, a two factor structure.

Table 1

Subscales as Predictors of the RTOP Total Score

R-squared as aPredictor of Total

Subscale 1: Lesson Design and Implementation 0.956Subscale 2: Content Propositional Pedagogic Knowledge 0.769Subscale 3: Content Procedural Pedagogic Knowledge 0.971Subscale 4: Classroom Culture Communicative Interactions 0.967Subscale 5: Classroom Culture Student/Teacher Relationships 0.941(Piburn & Sawada, 2000, p. 12)

An exploratory factor analysis was also conducted by Piburn & Sawada (2000) of the

25 items on the RTOP protocol using a database containing 153 classroom observations

and reported that an earlier reliability study implied the number of principal components

to be very small. Two strong factors and one weak factor were found to be appropriate

24

Page 35: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

and interpretable. Component 1 had an eigenvalue of 14.72, while component 2 and 3

have significantly lower eigenvalues of 2.08 and 1.18, respectively. These low factor loadings

indicate how weakly component 2 and 3 influence the measured factor. This was proven

further by the introduction of the following chart (Table 2) in the RTOP reference manual.

Table 2

Interpretation of the RTOP Factor Pattern

FactorRTOP Items 1 2 3

1. The instructional strategies and activities respected studentsprior knowledge and the preconceptions inherent therein.

**

2. The lesson was designed to engage students as members of alearning community.

****

3. In this lesson, student exploration preceded formal presenta-tion.

****

4. This lesson encouraged students to seek and value alternativemodes of investigation or of problem solving.

****

5. The focus and direction of the lesson was often determined byideas originating with students.

***

6. The lesson involved fundamental concepts of the subject. ****7. The lesson promoted strongly coherent conceptual under-

standing.***

8. The teacher had a solid grasp of the subject matter contentinherent in the lesson.

**

9. Elements of abstraction (i.e., symbolic representations, theorybuilding) were encouraged when it was important to do so.

*

10. Connections with other content disciplines and/or real worldphenomena were explored and valued.

**

11. Students used a variety of means (models, drawings, graphs,concrete materials, manipulatives, etc.) to represent phenom-ena.

**

12. Students made predictions, estimations and/or hypothesesand devised means for testing them.

****

13. Students were actively engaged in thought-provoking activitythat often involved the critical assessment of procedures.

***

14. Students were reflective about their learning. ***15. Intellectual rigor, constructive criticism, and the challenging

of ideas were valued.***

16. Students were involved in the communication of their ideas toothers using a variety of means and media.

***

17. The teachers questions triggered divergent modes of thinking. **

25

Page 36: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

FactorRTOP Items 1 2 3

18. There was a high proportion of student talk and a significantamount of it occurred between and among students.

***

19. Student questions and comments often determined the focusand direction of classroom discourse.

**

20. There was a climate of respect for what others had to say. * **21. Active participation of students was encouraged and valued. ** *22. Students were encouraged to generate conjectures, alternative

solution strategies, and ways of interpreting evidence.**

23. In general the teacher was patient with students. ****24. The teacher acted as a resource person, working to support

and enhance student investigations.****

25. The metaphor “teacher as listener” was very characteristic ofthis classroom.

***

* (0.5 - 0.59), ** (0.60 0.69), *** (0.70 0.79), **** (0.80 0.99)(Piburn & Sawada, 2000, p. 16)

Factor 1 named “inquiry orientation” draws heavily on all five sub-scales with the ex-

ception of sub-scale 2. While factor 2 labeled “content propositional knowledge” draws

exclusively on sub-scale 2. Factor 3, which is labeled “student/teacher relationship”, ac-

counts for less than five percent of the variance and only has three items that load on it. As

such, it is believed that a subset of the items from the RTOP could be used as an abbreviated

protocol measuring the same constructs as the original. Therefore, for the current study we

will use an abbreviated instrument (aRTOP) composed of items with large loadings onto the

two primary factors (See Table 3).

For the second factor of the aRTOP, focused on the content knowledge related to the

lesson, we include all 5 items from the original Subscale 2, as these items are likely to measure

something different from the remaining 20 items of the original RTOP. For the first factor

of this abbreviated instrument, focused on the inquiry orientation of the lesson, we chose

items that had significant loadings on the first factor, making sure to get items from each of

the related subscales. We also limited this factor to 5 items to match the size of the content

knowledge factor in order to keep this factor from dominating the total scale score.

26

Page 37: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

Table 3

aRTOP Items and Design

Inquiry Orientation Content Propositional Knowledge1. The lesson was designed to engage

students as members of a learningcommunity.

6. The lesson involved fundamentalconcepts of the subject.

2. Intellectual rigor, constructive crit-icism, and the challenging of ideaswere valued.

7. The lesson promoted strongly coher-ent conceptual understanding.

3. This lesson encouraged students toseek and value alternative modes ofinvestigation or of problem solving.

8. The teacher had a solid grasp of thesubject matter content inherent inthe lesson.

4. Students made predictions, estima-tions and/or hypotheses and devisedmeans for testing them.

9. Elements of abstraction (i.e., sym-bolic representations, theory build-ing) were encouraged when it wasimportant to do so.

5. The teacher acted as a resource per-son, working to support and enhancestudent investigations.

10. Connections with other content dis-ciplines and/or real world phenom-ena were explored and valued.

Mathematics Classroom Observation Protocol for Practices

The science specific language of the RTOP is a major disadvantage when used to ob-

serve mathematics classroom. This, along with the need for an observation protocol that

is supported in recent standards, led to the design of the Mathematics Classroom Observa-

tion Protocol for Practices (MCOP2). The MCOP2 is designed to be implemented in K-16

mathematics classrooms to measure the practices of students and teachers in the mathe-

matics classroom and how they align with the Process Standards of the National Council

of Teachers of Mathematics (National Council of Teachers of Mathematics, 2000); the Stan-

dards for Mathematical Practice from the Common Core State Standards in Mathematics

(National Governors Association Center for Best Practices, Council of Chief State School

Officers, 2010); “Crossroads” and “Beyond Crossroads” from the American Mathematical

Association of Two-Year Colleges (American Mathematical Association of Two-Year Colleges

(AMATYC), 1995, 2004); the Committee on the Undergraduate Program in Mathematics

27

Page 38: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

Curriculum Guide from the Mathematical Association of America (Barker et al., 2004);

and the Conference Board of the Mathematical Sciences statement on “Active Learning in

Post-Secondary Mathematics Education” (2016).

A pilot study was conducted by a graduate student in mathematics and a mathemat-

ics professor at a large southern university to determine if the data collected aligned with

the theoretical constructs and the verification of the expert survey. Based upon instructor

approval, 36 classrooms with 28 different instructors were observed throughout a semester.

The instructors varied widely from graduate teaching assistants to tenured full professors.

The classes they taught also varied from college algebra to upper division mathematics.

The MCOP2 that was used in the pilot study was initially designed to measure three

primary components; student engagement; lesson design and implementation; and class cul-

ture and discourse. Seventeen items of the original 18 item with full descriptions are used to

measure these three components. Student Engagement contained Items 1-5, Lesson Content

contained Items 6-11 and Classroom Culture and Discourse contained Items 12-17 (Gleason

& Cofer, 2014).

After all the data was collected, Gleason and Cofer conducted exploratory factor anal-

ysis (EFA) and classical test theory analysis with some unexpected results. The orginal

assumption of three components was reexamined after a low eigenvalue was found for the

third factor. Gleason and Cofer report that a factor matrix of a potential 3 Factor Model

indicated Student Engagement and Classroom Culture and Discourse were both loading on

the same factor. These two were combined to create Student Engagement and Classroom

Discourse. The 2 Factor Model explained over 50% of the total variance.

Cronbach’s alpha was also calculated for the entire protocol and both factors. The entire

protocol had a Cronbach’s alpha of 0.898. The sub-scales of “Lesson Content” and “Student

Engagement and Classroom Discourse” had Cronbach’s alpha reliabilities of 0.779 and 0.907,

respectively. Gleason and Cofer (2014) state, “the internal reliabilities are high enough for

both sub-scales and the entire instrument to be used to measure at the group level, either

28

Page 39: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

multiple observations of a single classroom or single observations of multiple classrooms” (p.

99). The overall high alpha coefficient demonstrates that MCOP2 is measuring something,

and the EFA clearly produces a 2 factor model of “Lesson Content” and “Student Engage-

ment and Classroom Discourse”. Overall this pilot study was very promising, but it was

truly in its beginning stage.

A test of the content was conducted with 164 identified experts in mathematics teaching

education. The first survey provided feedback on the initial 18 MCOP2 items and their

usefulness in measuring various practices of mathematics classrooms (Gleason, Livers, &

Zelkowski, 2017). Over 94% of the experts rated the items as either “essential” or “not

essential, but useful,” rather than “not useful.” After adjusting the MCOP2 items based on

the expert feedback, a second survey was conducted with 26 of the 164 experts that agreed

to provide additional information. This survey provided the experts with more details about

each item, the theoretical constructs, and the intended purpose of the MCOP2. Gleason,

Livers, and Zelkowski (2017) report 16 of the original 18 items were retained with mininal

revisions, because they all loaded on at least one of the factors. With the information gained

from the experts, the structure of the MCOP2 instrument was revised.

Gleason, Livers, and Zelkowski (2017) also conducted the inter-rater reliability of the

instrument to look at the response processes. Five raters were chosen with a variety of

educational and professional backgrounds. Two of the raters have doctorates in mathematics

education, one rater has a doctorate in mathematics and is heavily involved in mathematics

education research, one rater is a mathematics specialist that works with secondary teachers

and has taught at both the secondary and introductory college level, and the fifth rater is a

graduate student in mathematics with minimal background in education other than teaching

some introductory college math classes.

Five different classroom videos were scored by the five raters. All were give the detailed

descriptors of the items with the rubric prior to the viewing of the videos. All videos were

watched independently by each rater, and no formal training was conducted. To make sure

29

Page 40: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

Table 4

Brief Description of MCOP2 items

1. Students engaged in exploration/investigation/problem solving.2. Students used a variety of means (models, drawings, graphs, concrete materials,

manipulatives, etc.) to represent concepts.3. Students were engaged in mathematical activities.4. Students critically assessed mathematical strategies.5. Students persevered in problem solving.6. The lesson involved fundamental concepts of the subject to promote relational/

conceptual understanding.7. The lesson promoted modeling with mathematics.8. The lesson provided opportunities to examine mathematical structure. (symbolic

notation, patterns, generalizations, conjectures, etc.)9. The lesson included tasks that have multiple paths to a solution or multiple solutions.10. The lesson promoted precision of mathematical language.11. The teachers talk encouraged student thinking.12. There were a high proportion of students talking related to mathematics.13. There was a climate of respect for what others had to say.14. In general, the teacher provided wait-time.15. Students were involved in the communication of their ideas to others (peer-to-peer).16. The teacher uses student questions/comments to enhance mathematical understand-

ing.

there was a good representation of different levels of students and instructors, one video was

chosen from each of K-2, 3-5, 6-8, 9-12, and undergraduate. Gleason, Livers, and Zelkowski

(2017) used the sub-scale score to calculate the intra-class correlation (ICC) and report that

the inter-rater reliability was within acceptable levels.

The final version of the MCOP2 includes only 16 items (Table 4) measuring the two

primary constructs of teacher facilitation, focusing on the interactions that are primarily

dependent upon the teacher, and student engagement, focusing on the interactions that are

primarily dependent upon the students. Before the MCOP2 can be used in the undergrad-

uate classroom with confidence, the MCOP2 needs to be evaluated in other mathematics

classrooms at multiple higher education institutions. The type of institution needs to be

diversified to include liberal arts schools and other research universities.

30

Page 41: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

CHAPTER 3

METHODS

Aim of Study

The aim of this study was to investigate the structural validity and reliability of the

abbreviated Reformed Teaching Observation Protocol and the Mathematics Classroom Ob-

servation Protocol for Practices in the setting of undergraduate mathematics classrooms.

With the goal of answering the following research questions:

1. What are the internal structures of the Mathematics Classroom Observation Protocol

for Practices (MCOP2) and the abbreviated Reformed Teaching Observation Protocol

(aRTOP) for the population of undergraduate mathematics classrooms?

2. What are the internal reliabilities of the subscales of the Mathematics Classroom Ob-

servation Protocol for Practices (MCOP2) and the abbreviated Reformed Teaching Ob-

servation Protocol (aRTOP) with respect to undergraduate mathematics classrooms?

3. What are the relationships between the constructs measured by the Mathematics Class-

room Observation Protocol for Practices (MCOP2) and the abbreviated Reformed

Teaching Observation Protocol (aRTOP)?

Sample Description

The procedure for selecting the population is a crucial step in any study. Since the

study used Structural Equation Modeling (SEM) in the analysis, we aimed for a sample size

of 100-150 classroom observations at a variety of institutions (Ding, Velicer, & Harlow, 1995)

with a final sample size of 110 classroom observations.

31

Page 42: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

Although this was on the smaller end of the sample size of 100-150 originally intended,

the literature supports a smaller sample size when necessary. In a Monte Carlo study con-

ducted by Boomsma (1982), she found the widely cited recommendation for sample size to

be at least 100, but 200 was desirable. In the study conducted by Marsh, Hau, Balla, &

Grayson (1998), it was found that a sample size of 100 was sufficient when there were at least

four items per factor and more was better. Ding, Velicer, & Harlow (1995) recommends a

minimum of 3 indicators per factor and a minimum sample size of 100. Schumacker & Lomax

(2016) suggest a sample size of 100 to 150 for small models with well-behaved data. You can

also find studies that suggest 5 or 10 observations per estimated parameter (Bentler & Chou,

1987) and 10 cases per variable (Nunnally, 1978, p. 355). These rules are convenient, but

do not take into account the specifics of the model and may lead to over or underestimation

of the minimum sample size. Structural Equation Modeling (SEM) flexibility makes it hard

to create generalizations on the sample size required.

Although there are numerous studies of sample size, the study conducted by Wolf, Har-

rington, Clark, & Miller (2013) is most analogous to our study. They used the Monte Carlo

data simulation techniques to evaluate sample size requirements using maximum likelihood

(ML) estimator. The study compares a Confirmatory Factor Analysis (CFA) conducted with

different number of factors, indicators, and loadings to see what the minimum sample size

is required to “achieve minimal bias, adequate statistical power, and overall propriety of a

given model” (p. 920). The two-factor model with 6 to 8 indicators is most closely aligned

with our study, because the MCOP2 is a two-factor model with 9 indicators per factor and

the aRTOP is a two-factor model with 5 indicators per factor.

Although increasing the number of latent variables in the model resulted in an increased

minimum sample size, models required a smaller sample size when there were more indicators

per factor and stronger factor loadings. According to Wolf et. al (2013), a two-factor model

with 6 indicators required a sample size of 120, and 100, at factor loadings of .65, and .80,

respectively. Similarly, a two-factor model with 8 indicators required a sample size of 120,

32

Page 43: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

and 90, at factor loadings of .65, and .80, respectively. The number of factors, indicators,

and loadings of these factors create great variability in the required SEM sample size as

we can see from the study by Wolf et. al, (2013). They conclude that a “one size fits all”

approach has problems.

This study used the non-probability sampling method of convenience sampling in or-

der to reduce the relative travel cost and time required to achieve the sample size desired

(Johnson & Christensen, 2014). The chosen sample, to a large degree, represents the general

population of undergraduate mathematics classrooms, because the sample includes a large

number of classroom observations with a wide variety in class type, class size, institution,

demographics, etc.

The investigators observed 110 college mathematics classrooms at the undergraduate

level, with consent of the instructor, representing a wide variety of college and university

classrooms with faculty members of the classrooms observed range in age from 22 and up,

with a mixture of gender and ethnic backgrounds.

The American Mathematical Society’s Annual Survey of the Mathematical Sciences

provides a way to group colleges and universities into two distinct classifications based upon

the highest mathematics degree offered at the institution: doctorate-granting universities and

master’s and baccalaureate colleges and universities. For our study, we included three large

southern doctorate-granting universities with enrollments of approximately 18,000 to 35,000.

The percent of full time students range from 64 to 85 and the percent of undergraduate

students range from 62 to 85. These universities are comprised of approximately 49 to 60

percent female students. All three of these universities have students with a wide variety of

ethnic backgrounds.

We also include eight southern master’s and baccalaureate colleges and universities with

enrollments between 1,100 to 15,000 students. Of the eight southern master’s and baccalau-

reate colleges and universities, only two offer a master’s in mathematics. The percent of full

time students range from 43 to 90 and the percent of undergraduate students range from

33

Page 44: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

45 to 100. These colleges and universities are comprised of approximately 50 to 72 percent

female students. All eight colleges and universities have students with a wide variety of

ethnic backgrounds.

We purposefully chose the institutions in this study to avoid atypical demographics.

Any institution with a high representation of one specific demographic was excluded. For

example, student populations composed exclusively or almost exclusively of women were

excluded from this study because it does not represent a typical college population. With

these selections of institutions, we were able to obtain 46 observations at doctorate-granting

universities, 21 observations at master’s universities, and 43 observations at baccalaureate

college and universities (See Table 5) in this study to overcome any potential bias due to the

convenience sampling.

Table 5

Number of Observations at each Institution

Institution Number of ObservationsDoctorate university 1 36Doctorate university 2 3Doctorate university 3 7Master’s university 1 10Master’s university 2 11Baccalaureate university 1 8Baccalaureate university 2 2Baccalaureate university 3 11Baccalaureate university 4 14Baccalaureate university 5 2Baccalaureate university 6 6The actual names of the institutions were not included to protect the privacyand confidentiality of the participants.

In this study, we included 89 lower level and 21 upper level undergraduate mathematics

classrooms. Although there are more lower level than upper level mathematics classrooms

in this study, we feel this is a good representation of the percentage of lower and upper level

mathematics classes offered in a typical college or university semester. The actual classrooms

34

Page 45: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

chosen for observation were selected from faculty members at each institution who elected

to participate in the study.

The college and universities in this study were chosen to avoid the overrepresentation or

underrepresentation of a specific group, recognizing we have no control over the instructors

who chose to participate. We had some instructors that did not respond or chose not to

participate in this study. The main concern was if this group of non-responders or non-

participants will affect the validity of our study results (Hartman, Fuqua, & Jenkins, 1986).

The self-selected nature of this sample will most likely include instructors who have an

interest in teaching and learning issues (Hora & Ferrare, 2013a). For instance, teachers who

have a student-centered classroom were more likely to respond than teachers that hold direct

lecture only.

Seventy-two mathematics faculty members agreed to participate in this study. Since

some instructors teach two or more completely different courses, a total of 110 observations

were conducted in the Spring 2016, Fall 2016, and Spring 2017 semester. Only 86 of the 110

observations have instructor demographics data, because 15 instructors did not complete

the demographics survey. Of the 110 classroom observations, 50 were taught by a female

instructor and 60 were taught by a male instructor. The instructors self-identified between

the ages of 18-65 and older, 2% were 18-24 years old, 48% were 25-34 years old, 17% were 35-

44 years old, 12% were 45-54 years old, 16% were 55-64 years old, and 5% were 65 years and

over. 13% identified as Asian/Pacific Islander, 5% identified as Black or African American,

2% identified as Hispanic American, and 80% identified as White/Caucasian.

Of the 86 classrooms on which we have full demographic data about the instructor, 13

were taught by a Graduate Teaching Assistant, 24 were taught by an Adjunct/Instructor,

24 were taught by an Assistant Professor, 8 were taught by an Associate Professor, and 17

were taught by a Full Professor. The instructors were asked to self-identified their highest

level of education, 2% had a Bachelor’s degree, 31% had a Master’s degree, 63% had a PhD,

35

Page 46: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

and 3% had other advanced degree beyond a Master’s degree (e.g. Educational Specialist

(Ed.S.)).

They were also asked to identify how many years they have taught at the high school

level and college level. Over 75% reported only teaching at the high school level for less than

one year. The range of years spent teaching at the college level varied, with 1% teaching for

less than one year, 27% teaching for 1-5 years, 31% teaching for 6-10 years, 13% teaching

for 11-15 years, and 28% teaching over 15 years. A complete list of instructor demographics

is included in Appendix B.

The use of convenience sampling is one of the limitations of this study requiring toler-

ances. Convenience sampling can lead to the under-representation or over-representation of

a particular group of the sample. Another sampling issue that needs to be accounted for

is the presence of outliers, since convenience sampling is particularly influenced by outliers.

Our sample was chosen to avoid including classroom observations likely to give us unusual

data by collecting a large sample from a diverse range of institutions based on enrollment

demographics and types of degrees offered that reasonably represent the larger population

of undergraduate institutions in the United States.

Another limitation of this study is observer bias. Unfortunately researchers are suscep-

tible to obtaining the results they want to find. Observer biases can be positive or negative.

These biases can be a product of personal experience, environment and/or social and cultural

conditioning. Reflexivity, self reflection by the researcher on their biases and predispositions,

is the key strategy for avoiding researcher bias (Johnson & Christensen, 2014). Although

it was not possible to remove this potential bias completely, the observer was aware of the

influence that these biases may cause and made every effort to avoid their influence. In order

to help avoid observer bias, the observer read through each protocol item and rubric for each

observation. This helped the observer to make decisions based solely on the rubric outlined

by each protocol and to avoid bringing personal bias into the completion of each protocol.

Johnson & Christensen (2014) suggest, “Complete objectivity being impossible and pure

36

Page 47: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

subjectivity undermining credibility, the researchers focus is on balance - understanding and

depicting the world authentically in all its complexity while being self-analytical, politically

aware, and reflexive in consciousness” (p. 420).

Instruments

From the review of the literature, we see there are many ways that we can evaluate

college mathematics instruction. It is impossible to include all the observation protocols

used to evaluate undergraduate classes and so two protocols were chosen to align with the

research questions, an abbreviated form of the Reformed Teaching Observation Protocol for

its widely known use and the Mathematics Classroom Observation Protocol for Practices for

its mathematics specific design.

Reformed Teaching Observation Protocol (RTOP) is a 25 item protocol designed to be

used for both science and mathematics classroom observations. Piburn et al. (2000) divide

the RTOP into 5 sub-scales in order to test the hypothesis that “Inquiry-Orientation” is

a major part of the structure of the RTOP. One of the sub-scales, procedural pedagogic

knowledge, is a very high predictor of the total score with an R-squared of 0.971, and thus

97.1% of the variance accounted for by the predictor. This result, along with an exploratory

factor analysis finding two strong factors, solidified our idea that an abbreviated version of

the RTOP would produce a similar amount of information as the full instrument. This led

to the creation of the abbreviated Reformed Teaching Observation Protocol (aRTOP) to be

used in this study. (See Table 3 and Appendix C.2)

The theoretical structure of the aRTOP is depicted in Figure 1 and is based on the

results of the RTOP. Inquiry Orientation and Content Propositional Knowledge are the two

theoretical constructs that will be measured with the 10 observed variables. A double arrow

accounts for these two constructs being correlated. The model also contains a stochastic

error term accounting for the influence of unobserved factors.

37

Page 48: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

Figure 1: Theoretical Model of aRTOP

The other observation protocol that we will focus on is the Mathematics Classroom

Observation Protocol for Practices (MCOP2). It is designed to be implemented in K-16

mathematics classrooms to measure mathematics classroom interactions. The MCOP2 mea-

sures the two primary constructs: teacher facilitation and student engagement. Sixteen

items with full descriptions are used to measure these two components. The validity and

reliability of the Mathematics Classroom Observation Protocol for Practices (MCOP2) has

been assessed in numerous ways. A survey of 164 experts in mathematics education was

conducted to test the content of the MCOP2. The results from this survey and a second

follow up survey were used to revise the original 18 MCOP2 items to 16 items. Inter-rater

reliably was also calculated with a panel of five raters of various backgrounds without any

formal training. This resulted in the intra-class correlation (ICC) of 0.669 for the Teacher

Facilitation Sub-scale and 0.616 for the Student Engagement Sub-scale (Gleason et al., 2017).

(See Table 4 and Appendix C.3)

38

Page 49: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

The theoretical structure of the MCOP2 is depicted in Figure 2. There are two theoretical

constructs, student engagement and teacher facilitation that will be measured with the 16

observed variables. The double arrows between the two theoretical constructs represent the

correlation of these two factors. The model also included residual error terms to account for

the unmeasured variation in the model.

Figure 2: Theoretical Model of MCOP2

Procedures

After receiving approval from the University of Alabama Institutional Review Board

(IRB), we began the recruitment process of the selected institutions of higher education

through their local Institutional Review Boards. Upon approval of the institution to par-

39

Page 50: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

ticipate, an email was sent to all undergraduate mathematics instructors at that institution

informing them that participation in this study is strictly voluntary, but we would like to

observe a class they teach to understand the current status of mathematics instruction at the

undergraduate level. The email also explained that there is no foreseen risk associated with

this study and no individual benefits for the participants. For those that agree to allow us to

observe their classrooms, we confirmed a classroom observation at the teacher’s discretion.

The instructors of the mathematics courses were only asked to allow the investigators

to observe their classroom in order to complete the observation protocol forms. The time

commitment for the participants is to allow the researchers to observe one class period

(usually 50 or 75 minutes) during the Spring 2016, Fall 2016 and Spring 2017 semester, plus

about 5 minutes to read and complete the consent form and demographics form. There are

no other responsibilities for the participant.

For each classroom observation, the investigator arrived early and sat in a seat near the

back of the classroom. The goal as the observer was to blend in with the surroundings so the

students and instructor were not disturbed. The observer completed the Note Taker Form

(See Appendix C) during the lecture. At the conclusion, the observer used the information

collected on the Note Taker Form to complete both protocols. The observer alternated the

order the observation protocols were completed to avoid any bias that might be created.

Each classroom observation was given a number (1-200) that corresponds to the sequence

in which it was completed. Classroom observations labeled with an odd number (1,3,5,...)

were categorized as A indicating that the aRTOP protocol will be completed first. Classroom

observations labeled with an even number (2,4,6,...) were characterized as B indicating that

the MCOP2 protocol was completed first. This process was repeated until both protocols

were collected for all classroom observations.

Once all data was collected, we tested the theoretical structure for the MCOP2 and

the aRTOP using the statistical language and environment, R. All results are reported in

the aggregate to protect the confidentiality of the teachers. We use the linear regression

40

Page 51: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

coefficients and the fit statistics we gain from the Confirmatory Factor Analysis (CFA) to

test the internal structure of the MCOP2 and the aRTOP with respect to undergraduate

mathematics classrooms. Cronbach’s alpha are used to test internal reliability of the MCOP2

and the aRTOP with respect to undergraduate mathematics classrooms. We use regression

to assess the association between the constructs measured by the MCOP2 and the aRTOP.

Although Hu & Bentler (1999) “rule of thumb” cutoff criteria for fit indexes are widely

used today, Marsh, Hua, & Wen (2004) warns of the overgeneralization of Hu and Bentlers

findings. Schermelleh-Engel et al. (2003) includes a table of recommendations for model

evaluation, but suggests it is clear that these cutoff criteria should not be taken too seriously.

Table 6 provided an overview of some of the rules of thumb. Hu & Bentler (1998) suggest fit

indices can be effected by model misspecification, small-sample bias, effects of violation of

normality and independence, and estimation methods. Based on the work by Marsh, Hua,

& Wen (2004) and Schermelleh-Engel et al. (2003), it is safe to conclude that a model may

fit the data, but have one or more fit measures that suggest a bad fit.

As you can see in Table 6, we have included five fit indices. Chi squared divided by

degrees of freedom of the model(χ2/df), Root Mean Square Error of Approximation (RM-

SEA), and Standardized Root Means Square Residual (SRMR) are classified as descriptive

measures of overall model fit. Comparative Fit Index (CFI) and Goodness of Fit Index

(GFI) are both descriptive measures based on model comparison. Schumacker & Lomax

(2016) suggest reporting χ2, RMSEA, and SRMR in general and then suggest adding extra

descriptive measures based on model comparison if necessary. We have chosen to report

each of these fit indices, because they are said to be less influenced by small sample size

(Schermelleh-Engel et al., 2003).

First let us look briefly at each of the descriptive measures of overall model fit. Chi

squared (χ2) indicates that the observed and implied variance-covariance matrix are similar

or different (Schumacker & Lomax, 2016). χ2 is relative to the degrees of freedom of the

model causing it to be hard to compare. Chi squared (χ2) is also sensitive to sample size

41

Page 52: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

therefore we decided to include χ2/df . Dividing by the degree of freedom allows us to

compare and analyze the model fit easier (Schermelleh-Engel et al., 2003).

Root Mean Square Error of Approximation (RMSEA) is the measure of approximate

fit and is related with the differences due to approximation error (Schermelleh-Engel et al.,

2003). RMSEA takes the model degrees of freedom and sample size into account when

calculating (Schumacker & Lomax, 2016). Therefore, it is relatively independent of sample

size.

Root Means Square Residual (RMR) uses the square root of the means-squared difference

between matrix elements of the sample covariance matrix and the model-implied covariance

matrix (Schumacker & Lomax, 2016). Unfortunately, RMR depends on the size of the

variances and covariances of the observed variables (Schermelleh-Engel et al., 2003). Both

Schumacker & Lomax (2016) and Schermelleh-Engel et al. (2003) suggest reporting the

Standardized Root Means Square Residual (SRMR) which has an acceptance level when less

than .05.

Now let us look at both descriptive measures based on model comparison that are

included in this study. According to Schermelleh-Engel et al. (2003), the Comparative Fit

Index (CFI) is an adjusted version of the Relative Noncentrality Index (RNI). It was chosen

to be reported in our study, because it not as susceptible to underestimation of the model

fit when a small sample size exists. Goodness of Fit Index (GFI) is based on the sum of the

squared differences between the sample covariance matrix and the model-implied covariance

matrix to the observed variance (Schumacker & Lomax, 2016). Unfortunately, the GFI is

not independent of sample size (Schermelleh-Engel et al., 2003). Researchers will need to

take into account that a low GFI may be due to a small sample and not a poor fit.

The acceptance level of Cronbach’s alpha depends upon the instrument being used in

the early stages of research, for basic research tools, or as a scale of an individual for a

clinical situation (Nunnally, 1978). Alpha values of .7 to .8 are regarded as satisfactory in

our case, because we are not looking at the scale of the individual, but with such values, the

42

Page 53: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

Table 6

Recommendations for Model Evaluation: Some Rules of Thumb

Fit Measure Good Fit Acceptable Fit

χ2/df 0 ≤ χ2/df ≤ 2 2 < χ2/df ≤ 3

RMSEA 0 ≤ RMSEA ≤ .05 .05 ≤ RMSEA ≤ .08

SRMR 0 ≤ SRMR ≤ .05 .05 < SRMR ≤ .10

CFI .97 ≤ CFI ≤ 1.00 .95 ≤ CFI < .97

GFI .95 ≤ GFI ≤ 1.00 .90 ≤ GFI < .95

Schermelleh-Engel, Moosbrugger, & Muller (2003)

instruments should be used for preliminary research to guide further understanding of the

constructs (Bland & Altman, 1997; Nunnally, 1978). According to Streiner (2003), Nunnally

was correct for acceptable alpha for research tools, but warns of an alpha over .90 most likely

indicates unnecessary redundancy.

43

Page 54: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

CHAPTER 4

RESULTS

Internal Structure

A confirmatory factor analysis (CFA) was conducted on the data gathered to analyze

the internal structure of the Mathematics Classroom Observation Protocol for Practices

(MCOP2) and the abbreviated Reformed Teaching Observation Protocol (aRTOP) for the

population of undergraduate mathematics classrooms using R version 3.3.0 (2016) with the

lavaan package (Rosseel, 2012). CFA allows us to examine the relationship between the

observed variables and their underlying latent constructs for both the MCOP2 and the

aRTOP. The analysis of the fit indices for the CFA allows us to inspect the model fit for

both observation protocols.

As mentioned earlier, the aRTOP has two theoretical constructs, Inquiry Orientation

and Content Propositional Knowledge. Measured with the 10 observed variables. Items x1,

x2, x3, x4, and x5 load on Inquiry Orientation and x6, x7, x8, x9, and x10 load on Content

Propositional Knowledge. The aRTOP model with standardized factor loadings as well as

the standardized variance and covariance are included in Figure 3, with the factor loadings

relatively high for eight of the items. The goodness of fit indices for the aRTOP reveal a

poor fit (χ2/df =3.48, RMSEA=.15, SRMR=.12, CFI=.82 and, GFI=.83).

Although some indicator variables were low and modification indices existed, theory led

to the inclusion of these observed variables. For example, item x10, “connection with other

content disciplines and/or real world phenomena were explored and valued,” in the aRTOP

has a factor loading of .03. This tells 3% of the variance in Content Propositional Knowledge

is explained by item x10. Although this is low and modification indices suggest removal of

44

Page 55: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

Figure 3: Confirmatory Factor Analysis Results: aRTOP

this item, theory tells us that content being connected to the real world or other disciplines is

important to Content Propositional Knowledge. Schermelleh-Engel, Moosbrugger, & Muller

(2003) support this idea stating, “one should never modify a model solely on the basis of

modification indices, although the program might suggest to do so” (p. 61).

The MCOP2 has two constructs, Student Engagement and Teacher Facilitation measured

by 16 items. Items y1, y2, y3, y4, y5, y12, y13, y14, and y15 load on Student Engagement,

while items y4, y6, y7, 78, y9, y10, y11, y13, and y16 load on Teacher Facilitation. The

MCOP2 model with standardized factor loadings as well as the standardized variance and

covariance are included in Figure 4, with the standardized loadings relatively high for most

items. The goodness of fit indices for the MCOP2 reveal an acceptable fit for three indices

(χ2/df =1.19, SRMR=.08, and CFI=.90), and a poor fit for the other indices (RMSEA=.09

and GFI=.81). See Table 6 for recommendations for model evaluation.

45

Page 56: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

Figure 4: Confirmatory Factor Analysis Results: MCOP2

Internal Reliability

Cronbach’s alpha (1951) was calculated to analyze the internal reliability of the Math-

ematics Classroom Observation Protocol for Practices (MCOP2) and the abbreviated Re-

formed Teaching Observation Protocol (aRTOP) with respect to undergraduate mathematics

classrooms using R version 3.3.0 (2016) with the Rcmdr package (Fox, 2005, 2017; Fox &

Bouchet-Valat, 2017).

The alpha values for the subscales of the aRTOP, were .753 for the Inquiry Orientation

Subscale and .605 for the Content Propositional Knowledge Subscale. The Cronbach‘s alpha

46

Page 57: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

for the first subscale is near the satisfactory range for basic research given by Nunnally (1978,

p. 245-246), while the second subscale is in the range for preliminary research.

Similarly, the Cronbach’s alpha values for the subscales of the MCOP2 were .888 for

the Student Engagement Subscale and .812 for the Teacher Facilitation Subscale. Both of

these subscales are therefore in the satisfactory range for basic research (Nunnally, 1978, p.

245-246) and are near acceptable levels for individual measurement.

Relationship between the Constructs

Simple Linear Regression analysis was conducted to estimate the relationship between

the constructs measured by the Mathematics Classroom Observation Protocol for Practices

(MCOP2) and the abbreviated Reformed Teaching Observation Protocol (aRTOP). Before

we could conduct linear regression we must first check the linear regression assumptions. The

assumptions that must be satisfied are (a) linearity of the model is good, (b) distribution of

the error has constant variance (homoscedasticity), (c) the errors are normally distributed,

(d) independent variables are determined without error, and (e) errors are independent

(Mathews, 2005).

Weisberg (Weisberg, 2005) suggests plots of residuals with other quantities are useful in

finding failures of assumptions. The residual plot of Regression Model 1 (See Figure 5) has

been included below to help aid in the discussion. You will find a complete list of all residual

plots in Appendix D for each of the models. The first plot “Residuals versus Fitted” and

the second plot “Normal Q-Q” are most useful in simple regression to determine if these

assumptions are met. Notice in the “Residual verses Fitted” plot there is no pattern and the

red line is fairly flat. This implies we have meet assumption of linearity and homoscedasticity.

In the “Normal Q-Q” plot, we see the points lie on the diagonal line pointing to normal

distribution. The last two assumptions are satisfied by the data collection and study design.

A simple linear regression was calculated to predict Regression Model 1: Student En-

gagement based on Inquiry Orientation (See Figure 6 in Appendix D). A significant regres-

47

Page 58: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

Figure 5: Residual Plots of Regression Model 1

sion equation was found (F(1,108)=271.8, p < .001), with an R2 of .716 and adjusted R2

of .731. Roughly 72% of the variation in Student Engagement can be explained by Inquiry

Orientation. The linear regression equation predicted

(Student Engagement) = 6.671 + 1.11(Inquiry Orientation).

Student Engagement increased 1.11 for each one point increase in Inquiry Orientation.

Strongly correlated is defined by (Cohen, 1988) as Pearsons Product-Moment Correlation of

|r| > .5. Based on the results of the study, Student Engagement is strongly and positively

related to Inquiry Orientation with a r = .846, p < .001.

For Regression Model 2: Student Engagement based on Content Propositional Knowl-

edge (See Figure 7 in Appendix D) a simple linear regression was calculated. A significant

regression equation was found (F(1,108)= 44.8, p < .001), with an R2 of .293 and adjusted R2

of .287. Roughly 29% of the variation in Student Engagement can be explained by Content

48

Page 59: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

Propositional Knowledge. The linear regression equation predicted

(Student Engagement) = 3.93 + 0.957(Content Propositional Knowledge).

Student Engagement increased 0.957 for each one point increase in Content Propositional

Knowledge. Based on the results of the study, Student Engagement is strongly and positively

related to Content Propositional Knowledge with a r= .541, p < .001.

For Regression Model 3: Teacher Facilitation based on Inquiry Orientation (See Figure

8 in Appendix D) a simple linear regression was calculated. A significant regression equation

was found (F(1,108)= 213.6, p < .001), with an R2 of .664 and adjusted R2 of .661. Roughly

66% of the variation in Teacher Facilitation can be explained by Inquiry Orientation. The

linear regression equation predicted

(Teacher Facilitation) = 8.73 + .926(Inquiry Orientation).

Teacher Facilitation increased .926 for each one point increase in Inquiry Orientation. Based

on the results of the study, Teacher Facilitation is strongly and positively related to Inquiry

Orientation with a r =. 815, p < .001.

Also a simple linear regression was calculated to predict Regression Model 4: Teacher

Facilitation based on Content Propositional Knowledge (See Figure 9 in Appendix D). A

significant regression equation was found (F(1,108)= 142.4, p < .001), with an R2 of .569 and

adjusted R2 of .565. Roughly 57% of the variation in Teacher Facilitation can be explained

by Content Propositional Knowledge. The linear regression equation predicted

(Teacher Facilitation) = 1.86 + 1.15(Content Propositional Knowledge).

49

Page 60: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

Teacher Facilitation increased in 1.15335 for each one point increase in Content Propositional

Knowledge. Based on the results of the study, Teacher Facilitation is strongly and positively

related to Content Propositional Knowledge with a r =. 75, p < .001.

To predict Regression Model 5: Inquiry Orientation based on Content Propositional

Knowledge (See Figure 10 in Appendix D) a simple linear regression was calculated. A

significant regression equation was found (F(1,108)= 48.8, p < .001), with an R2 of .311 and

adjusted R2 of .305. Roughly 31% of the variation in Inquiry Orientation can be explained

by Content Propositional Knowledge. The linear regression equation predicted

(Inquiry Orientation) = −1.05 + .751(Content Propositional Knowledge).

Inquiry Orientation increased .751 for each one point increase in Content Propositional

Knowledge. Based on the results of the study, Inquiry Orientation is strongly and positively

related to Content Propositional Knowledge with a r= .558, p < .001.

A simple linear regression was calculated to predict Regression Model 6: Student Engage-

ment based on Teacher Facilitation (See Figure 11 in Appendix D). A significant regression

equation was found (F(1,108)= 217.3, p < .001), with an R2 of .668 and adjusted R2 of

.665. Roughly 67% of the variation in Student Engagement can be explained by Teacher

Facilitation. The linear regression equation predicted

(Student Engagement) = .462 + .945(Teacher Facilitation).

Student Engagement increased .945 for each 1 point increase in Teacher Facilitation. Based

on the results of the study, Student Engagement is strongly and positively correlated to

Teacher Facilitation with a r = .817, p < .001.

In summary, we found linear regression models for each pair of constructs based on the

Residual Plots meeting the linear regression assumptions (For a summary of the results,

see Table 7). A stronger amount of the variance was explained in the Regression Model 1,

50

Page 61: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

Regression Model 3, and Regression Model 6. Content Propositional Knowledge was the

common construct between the models that had lower variance explained. The F-statistic

supports these findings with values greater than 200 for Regression Model 1, Regression

Model 3, and Regression Model 6. In Table 8 we see the correlations of over .80 correspond

to Regression Model 1, Regression Model 3, and Regression Model 6.

Table 7

Simple Linear Regression Results

Model PredictorStandardizedRegressionCoefficient

t valuedf=108

R2 F-statistic

Regression Model 1(Intercept)

inquiry6.67 ***1.11 ***

10.3616.49

.716 271.8

Regression Model 2(Intercept)

content3.93 *.957 ***

2.076.69

.293 44.8

Regression Model 3(Intercept)

inquiry8.73 ***.926 ***

14.4114.61

.664 213.6

Regression Model 4(Intercept)

content1.861.15 ***

1.4511.93

.569 142.4

Regression Model 5(Intercept)

content-1.05.751 ***

-0.736.99

.311 48.8

Regression Model 6(Intercept)facilitation

.462

.945 ***0.4214.74

.668 271.3

* p<0.05, ** p<.01, *** p<.001Regression Model 1: Student Engagement and Inquiry Orientation,Regression Model 2: Student Engagement and Content Propositional Knowledge,Regression Model 3: Teacher Facilitation and Inquiry Orientation,Regression Model 4: Teacher Facilitation and Content Propositional Knowledge,Regression Model 5: Inquiry Orientation and Content Propositional Knowledge,Regression Model 6: Student Engagement and Teacher Facilitation

51

Page 62: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

Table 8

Pearson’s Product-Moment Correlation

Inquiry Content Engagement FacilitationInquiry -Content .5578761 -Engagement .8459433 .5414396 -Facilitation .8149469 .7540780 .8173191 -p < .001 for all correlations.

52

Page 63: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

CHAPTER 5

DISCUSSION

The improvement of Science, Technology, Engineering, and Mathematics (STEM) un-

dergraduate education is on the minds of faculty and staff at colleges and universities around

the United States. Every day we, as educators, are challenged by our departments and uni-

versities to make advances in the classroom, but how do we know if the changes we make

positively impact our students? Peer evaluation, student evaluations, and portfolio assess-

ment are the primary methods of formative and summative assessment instructors have to

evaluate their classroom. Although each of these methods is useful, they can be riddled with

subjective information that can skew the window into what is happening in an undergraduate

classroom.

Observation protocols like the Mathematics Classroom Observation Protocol for Prac-

tices (MCOP2) and the abbreviated Reformed Teaching Observation Protocol (aRTOP) are

a more objective way for an instructor to analyze their classroom. Before these observation

protocols could be used in the classroom with confidence, a study needed to be conducted to

examine both the aRTOP and the MCOP2. Although this study needs to be repeated and

extended to further validate the use of observation protocols in the classroom, the findings

have led to some conclusions on the internal structure, internal reliability, and the relation-

ship between the constructs measured by the observation protocols.

Study Limitations

While the current study provides useful information, there are several limitations that

must be mentioned. The use of convenience sampling is one limitation of this study. This

53

Page 64: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

sampling technique was unavoidable because of time and financial constraints. One major

concern with the use of a convenience sample is the inclusion of outliers that may skew the

data. Our sample was chosen to avoid including classroom observations likely to give us

unusual data. Every effort was made to include colleges and universities that are from a

diverse range of institutions based on enrollment demographics and types of degrees offered

that reasonably represent the larger population of undergraduate institutions in the United

States.

Positive or negative observer bias is another limitation of this study. Reflexivity was

used by the observer as outlined by Johnson & Christensen (2014). The observer spent time

reflecting about her own biases and predispositions to include a strategy for avoidance. The

observer also read through each protocol item and rubric to help the observer make decisions

based on only what happened in the classroom. Although it is not possible to remove the

potential for biases completely, the observer made a conscious effort to evade its influences.

Another limitation to this study is the effect of sample size on fit indices. The studies

conducted by Hu and Bentler (1999, 1998) shows how different fit indices are affected by

sample size with a true-populations and mis-specified models. We chose to include only fit

indices that are less likely to be influenced by sample size. As an unavoidable limitation, we

were careful when using the fit indices to decide if a model was supported by the data.

Conclusion

Confirmatory Factor Analysis (CFA) was conducted on the data gathered to analyze

the internal structure of the Mathematics Classroom Observation Protocol for Practices

(MCOP2) and the abbreviated Reformed Teaching Observation Protocol (aRTOP) for the

population of undergraduate mathematics classrooms. Factor loadings for the aRTOP were

relatively high for eight of the items. Although two items of the aRTOP did not have

high factor loadings, we included these items in our final model because of the theoretical

support for what should be happening in an undergraduate mathematics classroom based

54

Page 65: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

upon national recommendations. The aRTOP fit indices produced from the CFA reveal a

poor fit. The factor loadings for the MCOP2 were relatively high for most items and the

items that did not have high loadings were included, because the theoretical model is based

on mathematics educational standards. Two of the three descriptive measures of overall

model fit were in the satisfactory range and one of the two descriptive measures based on

model comparison were in the satisfactory range for the MCOP2. Our findings point to a

more consistent internal structure for the MCOP2 than the aRTOP.

Therefore, the Confirmatory Factor Analysis supports the previous Exploratory Factor

Analysis on the MCOP2 (Gleason & Cofer, 2014). We can clearly see that the MCOP2 is a

two factor model with almost all observed variables having high factor loadings. The three

acceptable fit indices show the measure of Student Engagement and Teacher Facilitation

are consistent with our theoretical understanding of the model. Although we would have

liked higher factor loadings and fit indices, we can still confirm the theoretical model for the

MCOP2.

The CFA for the abbreviated Reformed Teaching Observation Protocol (aRTOP) did

not align with the original design of the Reformed Teaching Observation Protocol (RTOP)

(Piburn & Sawada, 2000). We could see from the original design that a two factor model

with a reduced number of items would produce the same results. The factor loadings of

the current study support a two factor model with most observed variables having high

factor loadings. The poor fit indices show the measure of Inquiry Orientation and Content

Propositional Knowledge are somewhat consistent with our theoretical model of the aRTOP.

We would have liked higher factor loadings and fit indices for the aRTOP. With the current

design of the aRTOP and results from the CFA, we do not find support for the aRTOP as

a observation protocol for undergraduate mathematics.

To analyze the internal reliability, strength of that consistency, Cronbach’s alpha (1951)

was calculated for each subscale of both the Mathematics Classroom Observation Proto-

col for Practices (MCOP2) and the abbreviated Reformed Teaching Observation Protocol

55

Page 66: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

(aRTOP) with respect to undergraduate mathematics classrooms. Using Nunnally’s (1978)

acceptable range for Cronbach’s alpha, we were able to assess the alpha for each subscale.

When we examined the aRTOP, we found Inquiry Orientation to have near satisfactory in-

ternal reliability for basic research, and Content Propositional Knowledge was outside the

satisfactory range and only acceptable for preliminary research. We found both Student En-

gagement and Teacher Facilitation to have satisfactory internal reliability for basic research

and near the acceptable level for individual measure.

Therefore, for each each subscale, the satisfactory internal reliability of the Mathematics

Classroom Observation Protocol for Practices ((MCOP2) demonstrates that the instrument

is measuring something and producing similar scores. When we look at each factor indi-

vidually, the Student Engagement part of the MCOP2 instrument successfully gauges the

role of the student in an undergraduate mathematics classroom and their engagement in the

classroom environment. The high internal reliability of the Teacher Facilitation part of the

MCOP2 indicate the instrument is also successfully measuring the role of the instructor in

creating the structure and guidance in the classroom.

The abbreviated Reformed Teaching Observation Protocol (aRTOP) did not have as

high of an internal reliability. From the below satisfactory alphas for each subscale, we can

only say the aRTOP is measuring something and is somewhat consistent in its scores. Since

neither factor was in the satisfactory range for basic research, we do not find support for

the aRTOP as a observation protocol for undergraduate mathematics. The data analysis

conclude the MCOP2 has more internal reliability than the aRTOP.

Theoretically Inquiry Orientation, Content Propositional Knowledge, Student Engage-

ment, and Teacher Facilitation are related, but distinct, with respect to undergraduate

mathematics classrooms. To validate this theory a Simple Linear Regression analysis was

conducted to estimate the relationship between the constructs measured by the Mathemat-

ics Classroom Observation Protocol for Practices (MCOP2) and the abbreviated Reformed

Teaching Observation Protocol (aRTOP). The relationship between MCOP2 and aRTOP

56

Page 67: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

was also found to be significant. The Pearson’s Product-Moment Correlations for each pair

of constructs were found to be strongly correlated.

Therefore, for the constructs that have highest correlation, we can make some strong

conclusions. Mathematically, we have found that the student engagement is directly related

to the idea of an inquiry oriented classroom. Theoretically, when students are engaged,

an inquiry oriented classroom is possible. And conversely, an inquiry oriented classroom

means that the students are actively engaged in the learning community. Similarly, there is

a high correlation between teacher facilitation and inquiry oriented classroom. Without the

instructor facilitation, a classroom could not be a community of learners and the converse is

also true. Since both student engagement and teacher facilitation are highly correlated with

inquiry orientation, it is not hard to see why mathematically we found that student engage-

ment and teacher facilitation are also strongly correlated. Theoretically, the facilitation of

the teacher leads to an engaged body of students and the converse also follows.

We noticed the subscale, Content Propositional Knowledge, was the common construct

between the regression models that had lower variance explained. This leads us to believe

that Content Propositional Knowledge is measuring something completely different from the

other subscales. However, the data analysis suggests content propositional knowledge needs

to be assessed using another method besides the aRTOP due to the low internal reliability

of this construct on the aRTOP.

Despite some limitations to the current study, this study produced some important

findings. The internal structure of the aRTOP and MCOP2 were measured using the factor

loadings and fit indices. The MCOP2 had relatively high factor scores for most items. Three

of the fit indices for the MCOP2 were found to be in the acceptable range while none of the fit

indices for the aRTOP were acceptable. A decision was made not to modify the theoretical

model for either protocol because the deletion of items from each protocol would lead to

a decrease in the information gained from the undergraduate mathematics classroom. The

internal reliability of the aRTOP has been found to be below satisfactory and the internal

57

Page 68: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

reliability of the MCOP2 has been found to be highly satisfactory. We found a positive

and strong correlation between each pair of constructs with a higher correlation between

subscales that do not contain Content Propositional Knowledge. We found that the MCOP2

had a stronger internal structure and internal reliability than the aRTOP. We also found

that the theoretical relationships we had assumed between each construct was supported by

the linear regression we conducted.

Therefore, the absence of support of the structure of the aRTOP allows us to not feel

confident with what the protocol is measuring. We find higher confidence in the support

for the structure of the MCOP2. The internal reliability was also found to be higher for the

MCOP2, pointing to the protocol consistency. A high or low observation protocol score does

not just happen by chance with the MCOP2. The high correlation between subscales that

do not include the subscale, Content Propositional Knowledge, tell us that it is reasonable

to infer that the two observation protocols are measuring the same classrooms the same

way except for the Content Propositional Knowledge subscale of the aRTOP. Since the

MCOP2 has a stronger internal structure and internal reliability, we see no need to use both

protocols to measuse the same thing when the MCOP2 is more sucessful at acessing the

undergraduate mathematics classroom. The Content Propositional Knowledge is measuring

something completely different from the other subscales, but not very sucessfully. Content

Propositional Knowledge needs to be assessed using another method besides the aRTOP.

With confidence in what we are measuring with the MCOP2, consistency in the MCOP2, and

correlation among the subscales, we find support for the Mathematics Classroom Observation

Protocol for Practices MCOP2 as a useful assessment tool for undergraduate mathematics

classrooms and the abbreviated Reformed Teaching Observation Protocol (aRTOP) is no

longer necesary.

58

Page 69: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

Future Direction

Future research should seek to extend the current study to a broader sampling commu-

nity. Although the current sample size was adequate, a larger sample with more colleges

and universities included from a broader geographic region could lead to a deeper under-

standing of Mathematics Classroom Observation Protocol for Practices (MCOP2) and the

abbreviated Reformed Teaching Observation Protocol (aRTOP). Increasing the sample size

will allow the researcher to answer more comparative questions about the populations and

institutions included. For example, it would be interesting to compare how different types of

institutions perform with both observation protocols. With a larger sample size, you could

also compare how different job titles, highest level of education, genders, age, and years

of teaching relate to the constructs. Although we focused on undergraduate mathematics

education in this study, with a larger sample size you could look at how these observation

protocols perform at additional education levels as both protocols are designed to be used for

K-16. The applications of an extension of this study are limitless and would help contribute

to a better understanding of the undergraduate mathematics classroom.

59

Page 70: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

References

Abrami, P. C. (2001). Improving judgments about teaching effectiveness using teacher rating

forms. New Directions for Institutional Research, 2001 (109), 59-87.

Abrami, P. C., & d’Apollonia, S. (1990). The dimensionality of ratings and their use in

personnel decisions. New Directions for Teaching and Learning , 1990 (43), 97-111.

Aleamoni, L. M. (1981). Student ratings of instruction. In J. Millman (Ed.), Handbook of

Teacher Evaluation (p. 110-145). Beverly Hills, CA: Sage.

Algozzine, B., Gretes, J., Flowers, C., Howley, L., Beattie, J., Spooner, F., . . . Bray, M.

(2004). Student evaluation of college teaching: A practice in search of principles.

College Teaching , 52 (4), 134-141.

Allen, J., Gregory, A., Mikami, A., Lun, J., Hamre, B., & Pianta, R. (2013). Observations

of effective teacher-student interactions in secondary school classrooms: Predicting

student achievement with the classroom assessment scoring system-secondary. School

Psychology Review , 42 (1), 76-98.

American Mathematical Association of Two-Year Colleges (AMATYC). (1995). Cross-

roads in mathematics: Standards for introductory college mathematics before calculus.

(D. Cohen, Ed.). Memphis, TN: American Mathematical Association of Two Year

Colleges.

American Mathematical Association of Two-Year Colleges (AMATYC). (2004). Beyond

Crossroads: Implementing mathematics standards in the first two years of college

(R. Blair, Ed.). Memphis, TN: American Mathematical Association of Two Year

Colleges.

60

Page 71: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

Apodaca, P., & Grad, H. (2005). The dimensionality of student ratings of teaching: In-

tegration of uni-and multidimensional models. Studies in Higher Education, 30 (6),

723-748.

Ball, D. L., Thames, M. H., & Phelps, G. (2008). Content knowledge for teaching: What

makes it special? Journal of Teacher Education, 59 (5), 389-407.

Ballantyne, C. (2003). Online evaluations of teaching: An examination of current practice

and considerations for the future. New Directions for Teaching and Learning , 2003 (96),

103-112.

Barker, W., Bressoud, D., Epp, S., Ganter, S., Haver, B., & Pollatsek, H. (2004). Under-

graduate programs and courses in the mathematical sciences: CUPM curriculum guide,

2004. Washington, D.C.: Mathematical Association of America.

Bentler, P. M., & Chou, C.-P. (1987). Practical issues in structural modeling. Sociological

Methods & Research, 16 (1), 78–117.

Benton, S. L., & Cashin, W. E. (2012). Student ratings of teaching: A summary of research

and literature (IDEA Paper No. 50). The Idea Center. Retrieved 12/07/2015, from

http://ideaedu.org/wp-content/uploads/2014/11/idea-paper 50.pdf

Bernstein, D. J., Jonson, J., & Smith, K. (2000). An examination of the implementation of

peer review of teaching. New Directions for Teaching and Learning , 2000 (83), 73-86.

Bland, J. M., & Altman, D. G. (1997). Statistics notes: Cronbach’s alpha. BMJ , 314 (7080),

572.

Boomsma, A. (1982). The robustness of LISREL against small sample sizes in factor analysis

models. In K. G. Jreskog & H. Wold (Eds.), Systems under indirect observation:

Causality, structure, prediction (Vol. 1, pp. 149–173). North-Holland.

Bowes, A. S., & Banilower, E. R. (2004). LSC classroom observation study: An analysis of

data collected between 1997 and 2003. Chapel Hill, NC: Horizon Research, Inc.

Boyer Commission on Educating Undergraduates in the Research University. (1998). Rein-

venting undergraduate education: A blueprint for America’s research universities.

61

Page 72: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

(Tech. Rep.). Stony Brook, NY: State University of New York at Stony Brook for

the Carnegie Foundation for the Advancement of Learning.

Bullock, C. D. (2003). Online collection of midterm student feedback. New Directions for

Teaching and Learning , 2003 (96), 95-102.

Burdsal, C. A., & Harrison, P. D. (2008). Further evidence supporting the validity of both a

multidimensional profile and an overall evaluation of teaching effectiveness. Assessment

& Evaluation in Higher Education, 33 (5), 567-576.

Burns, C. W. (2000). Teaching portfolios: Another perspective. Academe, 86 (1), 44-47.

Cashin, W. E. (1995). Student ratings of teaching: The research revisited (IDEA Paper

No. 32). The Idea Center. Retrieved 12/07/2015, from http://www.clemson.edu/

oirweb1/CourseEvalHelp/StudentRatingsResearch1995.pdf

Centra, J. A. (1993). Reflective faculty evaluation: Enhancing teaching and determining

faculty effectiveness. San Francisco: Jossey-Bass.

Centra, J. A. (2003). Will teachers receive higher student evaluations by giving higher grades

and less course work? Research in Higher Education, 44 (5), 495-518.

Centra, J. A. (2009). Differences in responses to the student instructional report: Is it bias?

Princeton, NJ: Educational Testing Service.

Centra, J. A., & Gaubatz, N. B. (2000). Is there gender bias in student evaluations of

teaching? The Journal of Higher Education, 71 (1), 17-33.

Chen, Y., & Hoshower, L. B. (2003). Student evaluation of teaching effectiveness: An

assessment of student perception and motivation. Assessment & Evaluation in Higher

Education, 28 (1), 71-88.

Cheung, D. (2000). Evidence of a single second-order factor in student ratings of teaching

effectiveness. Structural Equation Modeling , 7 (3), 442-460.

Clayson, D. E. (2009). Student evaluations of teaching: Are they related to what students

learn? A meta-analysis and review of the literature. Journal of Marketing Education,

31 (1), 16-30.

62

Page 73: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale,

N.J. : L. Erlbaum Associates.

Collins, J. W., & O’Brien, N. P. (2003). The Greenwood dictionary of education. Westport,

Connecticut: Greenwood Press.

Conference Board of the Mathematical Sciences. (2016). Active learning in post-secondary

mathematics education. Washington DC: Author.

Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika,

16 (3), 297–334.

d’Apollonia, S., & Abrami, P. C. (1997). Navigating student ratings of instruction. American

Psychologist , 52 (11), 1198-1208.

Davis, B. G. (2009). Tools for teaching (2nd ed.). San Francisco, CA: Jossy-Bass.

Dayton Regional STEM Center. (2011). Reformed Teaching Observation Protocol

(RTOP) with accompanying Dayton Regional STEM Center rubric. Retrieved

12/7/2015, from http://daytonregionalstemcenter.org/wp-content/uploads/

2012/09/rtop\ with\ rubric\ smp-1.pdf

Ding, L., Velicer, W. F., & Harlow, L. L. (1995). Effects of estimation methods, number

of indicators per factor, and improper solutions on structural equation modeling fit

indices. Structural Equation Modeling: A Multidisciplinary Journal , 2 (2), 119–143.

Dommeyer, C. J., Baum, P., Hanna, R. W., & Chapman, K. S. (2004). Gathering faculty

teaching evaluations by in-class and online surveys: Their effects on response rates and

evaluations. Assessment & Evaluation in Higher Education, 29 (5), 611-623.

Eiszler, C. F. (2002). College students’ evaluations of teaching and grade inflation. Research

in Higher Education, 43 (4), 483-501.

Ellis, J. F. (2014). Preparing future college instructors: The role of Graduate Student Teach-

ing Assistants (GTAs) in successful college calculus programs (Unpublished doctoral

dissertation). University of California, San Diego.

63

Page 74: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

Ellis, L., Burke, D. M., Lomire, P., & McCormack, D. R. (2003). Student grades and average

ratings of instructional quality: The need for adjustment. The Journal of Educational

Research, 97 (1), 35-40.

Feldman, K. A. (1977). Consistency and variability among college students in rating their

teachers and courses: A review and analysis. Research in Higher Education, 6 (3),

223-274.

Feldman, K. A. (1978). Course characteristics and college students’ ratings of their teachers:

What we know and what we don’t. Research in Higher Education, 9 (3), 199-242.

Feldman, K. A. (1993). College students’ views of male and female college teachers: Part II

-Evidence from students’ evaluations of their classroom teachers. Research in Higher

Education, 34 (2), 151-211.

Feldman, K. A. (2007). Identifying exemplary teachers and teaching: Evidence from stu-

dent ratings. In R. P. Perry & J. C. Smart (Eds.), The Scholarship of Teaching and

Learning in Higher Education: An Evidence-Based Perspective (p. 93-143). Springer

Netherlands.

Flick, L. B., Sadri, P., Morrell, P. D., Wainwright, C., & Schepige, A. (2009). A cross

discipline study of reformed teaching by university science and mathematics faculty.

School Science and Mathematics , 109 (4), 197-211.

Fox, J. (2005). The R Commander: A basic statistics graphical user interface to R. Journal

of Statistical Software, 14 (9), 1–42. Retrieved from http://www.jstatsoft.org/v14/

i09

Fox, J. (2017). Using the R Commander: A point-and-click interface for R. Boca Raton

FL: Chapman and Hall/CRC Press. Retrieved from http://socserv.mcmaster.ca/

jfox/Books/RCommander/

Fox, J., & Bouchet-Valat, M. (2017). Rcmdr: R Commander [Computer software man-

ual]. Retrieved from http://socserv.socsci.mcmaster.ca/jfox/Misc/Rcmdr/ (R

package version 2.3-2)

64

Page 75: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

Freeman, S., Eddy, S. L., McDonough, M., Smith, M. K., Okoroafor, N., Jordt, H., & Wen-

deroth, M. P. (2014). Active learning increases student performance in science, engi-

neering, and mathematics. Proceedings of the National Academy of Sciences , 111 (23),

8410–8415.

Gasiewski, J. A., Eagan, M. K., Garcia, G. A., Hurtado, S., & Chang, M. J. (2012).

From gatekeeping to engagement: A multicontextual, mixed method study of student

academic engagement in introductory STEM courses. Research in Higher Education,

53 (2), 229–261.

Gleason, J., & Cofer, L. D. (2014). Mathematics classroom observation protocol for practices

results in undergraduate mathematics classrooms. In T. Fukawa-Connelly, G. Karakok,

K. Keene, & M. Zandieh (Eds.), Proceedings of the 17th Annual Conference on Research

on Undergraduate Mathematics Education, 2014, Denver, CO (p. 93-103).

Gleason, J., Livers, S., & Zelkowski, J. (2015). Mathematics Classroom Observation

Protocol for Practices: Descriptors manual. Retrieved 12/7/2015, from http://

jgleason.people.ua.edu/mcop2.html

Gleason, J., Livers, S., & Zelkowski, J. (2017). Mathematics Classroom Observation Protocol

for Practices (MCOP2): A validation study. Investigations in Mathematics Learning ,

9 .

Grossman, P. L., Wilson, S. M., & Shulman, L. S. (1989). Teachers of substance: Sub-

ject matter knowledge for teaching. In M. Reynolds (Ed.), The Knowledge Base for

Beginning Teachers (p. 23-36). New York: Pergamon.

Hamermesh, D. S., & Parker, A. (2005). Beauty in the classroom: Professors’ pulchritude

and putative pedagogical productivity. Economics of Education Review , 24 (4), 369–

376.

Hartman, B. W., Fuqua, D. R., & Jenkins, S. J. (1986). The problems of and remedies

for nonresponse bias in educational surveys. The Journal of Experimental Education,

54 (2), 85–90.

65

Page 76: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

Hatzipanagos, S., & Lygo-Baker, S. (2006). Teaching observations: Promoting development

through critical reflection. Journal of Further and Higher Education, 30 (4), 421-431.

Hill, H. C., Blunk, M. L., Charalambous, C. Y., Lewis, J. M., Phelps, G. C., Sleep, L., &

Ball, D. L. (2008). Mathematical knowledge for teaching and the mathematical quality

of instruction: An exploratory study. Cognition and Instruction, 26 (4), 430-511.

Hora, M. T. (2013). Exploring the use of the Teaching Dimensions Observation Protocol to

develop fine-grained measures of interactive teaching in undergraduate science class-

rooms (WCER Working Paper No. 2013-6). Retrieved 12/10/2015, from http://www

.wcer.wisc.edu/publications/workingpapers/Working Paper No 2013 06.pdf

Hora, M. T., & Ferrare, J. J. (2013a). Instructional systems of practice: A multidimensional

analysis of math and science undergraduate course planning and classroom teaching.

Journal of the Learning Sciences , 22 (2), 212–257.

Hora, M. T., & Ferrare, J. J. (2013b). A review of classroom observation techniques

in postsecondary settings (WCER Working Paper No. 2013-01). Wisconsin Center

for Education Research. Retrieved 12/7/2015, from http://www.wcer.wisc.edu/

publications/workingpapers/Working Paper No 2013 01.pdf

Hora, M. T., Oleson, A., & Ferrare, J. J. (2013). Teaching Dimensions Observation Pro-

tocol (TDOP) user’s manual. Madison, WI. Retrieved 12/7/2015, from http://

tdop.wceruw.org/

Hoyt, D. P., & Lee, E.-J. (2002). Basic data for the revised IDEA system (IDEA Tech-

nical Report No. 12). Retrieved 12/7/2015, from http://ideaedu.org/wp-content/

uploads/2014/11/techreport-12.pdf

Hu, L.-t., & Bentler, P. M. (1998). Fit indices in covariance structure modeling: Sensitivity

to underparameterized model misspecification. Psychological Methods , 3 (4), 424.

Hu, L.-t., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure

analysis: Conventional criteria versus new alternatives. Structural Equation Modeling:

A Multidisciplinary Journal , 6 (1), 1–55.

66

Page 77: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

Jackson, D. L., Teal, C. R., Raines, S. J., Nansel, T. R., Force, R. C., & Burdsal, C. A.

(1999). The dimensions of students’ perceptions of teaching effectiveness. Educational

and Psychological Measurement , 59 (4), 580-596.

Johnson, B., & Christensen, L. (2014). Educational research: Quantitative, qualitative, and

mixed approaches (5th ed.). Thousand Oaks, CA: Sage.

Keig, L., & Waggoner, M. D. (1994). Collaborative peer review: The role of faculty in

improving college teaching. Washington, D.C.: The George Washington University

School of Education and Human Development. (ASHE-ERIC Higher Education Report

No. 2.)

Kim, M. (2011). Differences in beliefs and teaching practices between international and US

domestic mathematics teaching assistants (Unpublished doctoral dissertation). The

University of Oklahoma.

Kohut, G. F., Burnap, C., & Yon, M. G. (2007). Peer observation of teaching: Perceptions

of the observer and the observed. College Teaching , 55 (1), 19-25.

Krautmann, A. C., & Sander, W. (1999). Grades and student evaluations of teachers.

Economics of Education Review , 18 (1), 59-63.

Kulik, J. A. (2001). Student ratings: Validity, utility, and controversy. New Directions for

Institutional Research, 2001 (109), 9-25.

Kung, D., & Speer, N. (2007). Mathematics teaching assistants learning to teach: Recast-

ing early teaching experiences as rich learning opportunities. In M. Oehrtman (Ed.),

Proceedings of the 10th annual Conference on Research in Undergraduate Mathematics

Education.

Laverie, D. A. (2002). Improving teaching through improving evaluation: A guide to course

portfolios. Journal of Marketing Education, 24 (2), 104-113.

Leung, D. Y., & Kember, D. (2005). Comparability of data gathered from evaluation

questionnaires on paper and through the internet. Research in Higher Education,

46 (5), 571-591.

67

Page 78: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

Marsh, H. W. (1984). Students’ evaluations of university teaching: Dimensionality, relia-

bility, validity, potential baises, and utility. Journal of Educational Psychology , 76 (5),

707-754.

Marsh, H. W. (2001). Distinguishing between good (useful) and bad workloads on students’

evaluations of teaching. American Educational Research Journal , 38 (1), 183-212.

Marsh, H. W., Hau, K.-T., Balla, J. R., & Grayson, D. (1998). Is more ever too much? the

number of indicators per factor in confirmatory factor analysis. Multivariate Behavioral

Research, 33 (2), 181–220.

Marsh, H. W., Hau, K.-T., & Wen, Z. (2004). In search of golden rules: Comment

on hypothesis-testing approaches to setting cutoff values for fit indexes and dangers

in overgeneralizing Hu and Bentler’s (1999) findings. Structural Equation Modeling ,

11 (3), 320–341.

Marsh, H. W., & Hocevar, D. (1991). Students’ evaluations of teaching effectiveness: The

stability of mean ratings of the same teachers over a 13-year period. Teaching and

Teacher Education, 7 (4), 303-314.

Marsh, H. W., & Roche, L. A. (2000). Effects of grading leniency and low workload on

students’ evaluations of teaching: Popular myth, bias, validity, or innocent bystanders?

Journal of Educational Psychology , 92 (1), 202.

Marsh, H. W., & Ware, J. E. (1982). Effects of expressiveness, content coverage, and

incentive on multidimensional student rating scales: New interpretations of the Dr.

Fox effect. Journal of Educational Psychology , 74 (1), 126.

Marshall, J. C., Smart, J., & Horton, R. M. (2010). The design and validation of EQUIP:

An instrument to assess inquiry-based instruction. International Journal of Science

and Mathematics Education, 8 (2), 299-321.

Mathews, P. G. (2005). Design of experiments with MINITAB. Milwaukee, WI: ASQ Quality

Press.

McKeachie, W. J. (1979). Student ratings of faculty: A reprise. Academe, 65 (6), 384-397.

68

Page 79: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

Mehdizadeh, M. (1990). Loglinear models and student course evaluations. The Journal of

Economic Education, 21 (1), 7-21.

Merritt, D. J. (2008). Bias, the brain, and student evaluations of teaching. St. John’s Law

Review , 82 (1), 235–287.

Michael, J. (2006). Where’s the evidence that active learning works? Advances in Physiology

Education, 30 (4), 159–167.

Morrell, P. D., Wainwright, C., & Flick, L. (2004). Reform teaching strategies used by

student teachers. School Science and Mathematics , 104 (5), 199-213.

National Council of Teachers of Mathematics. (2000). Principles and standards for school

mathematics. Reston, VA: National Council of Teachers of Mathematics.

National Governors Association Center for Best Practices, Council of Chief State School

Officers. (2010). Common Core State Standards Mathematics. Washington D.C.: Na-

tional Governors Association Center for Best Practices, Council of Chief State School

Officers. Retrieved 12/7/2015, from http://www.corestandards.org/Math

National Research Council. (1996). From analysis to action: Undergraduate education

in science, mathematics, engineering, and technology. Washington, D.C.: National

Academies Press.

National Research Council. (1999). Transforming undergraduate education in science, mathe-

matics, engineering, and technology. Washington, D.C.: The National Academy Press.

National Research Council. (2002). Evaluating and improving undergraduate teaching in

science, technology, engineering, and mathematics (M. A. Fox & N. Hackerman, Eds.).

Washington, D.C.: National Academies Press.

National Research Council. (2012). Discipline-based education research: Understand-

ing and improving learning in undergraduate science and engineering (S. R. Singer,

N. R. Nielsen, H. A. Schweingruber, et al., Eds.). Washington, D.C.: National

Academies Press.

69

Page 80: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

National Science Foundation. (1996). Shaping the future: New expectations for undergrad-

uate education in science, mathematics, engineering, and technology. Arlington, VA:

Author. (NSF 96-139)

National Science Foundation. (1998). Information technology: Its impact on undergradu-

ate education in science, mathematics, engineering, and technology. Arlington, VA:

Author. (NSF 98-82)

Nunnally, J. C. (1978). Psychometric theory (2nd ed.). New York: McGraw-Hill.

Pianta, R. C., & Hamre, B. K. (2009). Conceptualization, measurement, and improvement

of classroom processes: Standardized observation can leverage capacity. Educational

Researcher , 38 (2), 109–119.

Piburn, M., & Sawada, D. (2000). Reformed Teaching Observation Protocol (RTOP): Ref-

erence manual. Tempe, Arizona. Retrieved 12/7/2015, from http://files.eric.ed

.gov/fulltext/ED447205.pdf

President’s Council of Advisors on Science and Technology. (2012). Engage to excel: Pro-

ducing one million additional college graduates with degrees in science, technology,

engineering, and mathematics. report to the President. Washington, D.C.: Executive

Office of the President.

R Core Team. (2016). R: A language and environment for statistical computing [Computer

software manual]. Vienna, Austria. Retrieved from https://www.R-project.org/

Remmers, H. H. (1928). The relationship between students’ marks and student attitude

toward instructors. School & Society , 28 , 759-760.

Remmers, H. H. (1930). To what extent do grades influence student ratings of instructors?

The Journal of Educational Research, 21 , 314-316.

Remmers, H. H., & Brandenburg, G. C. (1927). Experimental data on the Purdue ratings

scale for instructors. Educational Administration and Supervision, 13 , 519-527.

70

Page 81: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

Rosseel, Y. (2012). lavaan: An R package for structural equation modeling. Journal of

Statistical Software, 48 (2), 1–36. Retrieved from http://www.jstatsoft.org/v48/

i02/

Sawada, D., Piburn, M. D., Judson, E., Turley, J., Falconer, K., Benford, R., & Bloom,

I. (2002). Measuring reform practices in science and mathematics classrooms: The

Reformed Teaching Observation Protocol. School Science and Mathematics , 102 (6),

245-253.

Schermelleh-Engel, K., Moosbrugger, H., & Muller, H. (2003). Evaluating the fit of struc-

tural equation models: Tests of significance and descriptive goodness-of-fit measures.

Methods of Psychological Research Online, 8 (2), 23–74.

Schumacker, R., & Lomax, R. (2016). A beginner’s guide to structural equation modeling

(4th ed.). Taylor & Francis.

Seldin, P. (2000). Teaching portfolios: A positive appraisal. Academe, 86 (1).

Seldin, P., & Miller, J. E. (2009). The academic portfolio: A practical guide to documenting

teaching, research, and service (Vol. 132). John Wiley & Sons.

Seymour, E. (2002). Tracking the processes of change in US undergraduate education in

science, mathematics, engineering, and technology. Science Education, 86 (1), 79-105.

Shevlin, M., Banyard, P., Davies, M., & Griffiths, M. (2000). The validity of student

evaluation of teaching in higher education: Love me, love my lectures? Assessment &

Evaluation in Higher Education, 25 (4), 397-405.

Shulman, L. S. (1986). Those who understand: Knowledge growth in teaching. Educational

Researcher , 4-14.

Smith, M. K., Jones, F. H., Gilbert, S. L., & Wieman, C. E. (2013). The classroom obser-

vation protocol for undergraduate STEM (COPUS): A new instrument to characterize

university STEM classroom practices. CBE-Life Sciences Education, 12 (4), 618-627.

Socha, A. (2013). A hierarchical approach to students’ assessments of instruction. Assess-

ment & Evaluation in Higher Education, 38 (1), 94-113.

71

Page 82: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

Sojka, J., Gupta, A. K., & Deeter-Schmelz, D. R. (2002). Student and faculty perceptions

of student evaluations of teaching: A study of similarities and differences. College

Teaching , 50 (2), 44-49.

Speer, N., & Hald, O. (2008). How do mathematicians learn to teach? Implications from

research on teachers and teaching for graduate student professional development. In

M. P. Carlson & C. Rasmussen (Eds.), Making the connection: Research and practice in

undergraduate mathematics education (p. 305-218). Washington, D.C.: Mathematical

Association of America.

Spooren, P., Brockx, B., & Mortelmans, D. (2013). On the validity of student evaluation

of teaching: The state of the art. Review of Educational Research, 83 (4), 598-642.

Retrieved from http://dx.doi.org/10.3102/0034654313496870

Streiner, D. L. (2003). Starting at the beginning: An introduction to coefficient alpha and

internal consistency. Journal of Personality Assessment , 80 (1), 99–103.

Thomas, S., Chie, Q. T., Abraham, M., Raj, S. J., & Beh, L.-S. (2014). A qualitative

review of literature on peer review of teaching in higher education: An application of

the SWOT framework. Review of Educational Research, 84 (1), 112-159.

Tucker, B., Jones, S., Straker, L., & Cole, J. (2003). Course evaluation on the web: Facili-

tating student and teacher reflection to improve learning. New Directions for Teaching

and Learning , 2003 (96), 81-93.

Wachtel, H. K. (1998). Student evaluation of college teaching effectiveness: A brief review.

Assessment & Evaluation in Higher Education, 23 (2), 191-212.

Walkington, C., Arora, P., Ihorn, S., Gordon, J., Walker, M., Abraham, L., & Marder,

M. (2012). Development of the UTeach observation protocol: A classroom observation

instrument to evaluate mathematics and science teachers from the UTeach preparation

program (Tech. Rep.). Retrieved 12/7/2015, from http://uteach.utexas.edu

Ware Jr, J. E., & Williams, R. G. (1975). The Dr. Fox effect: A study of lecturer effectiveness

and ratings of instruction. Academic Medicine, 50 (2), 149-56.

72

Page 83: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

Weisberg, S. (2005). Applied linear regression (3rd ed.). John Wiley & Sons.

Wieman, C., & Gilbert, S. (2014). The teaching practices inventory: A new tool for charac-

terizing college and university teaching in mathematics and science. CBE-Life Sciences

Education, 13 (3), 552-569.

Wolf, E. J., Harrington, K. M., Clark, S. L., & Miller, M. W. (2013). Sample size requirements

for structural equation models an evaluation of power, bias, and solution propriety.

Educational and Psychological Measurement , 73 (6), 913–934.

73

Page 84: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

APPENDIX A

OVERVIEW OF OBSERVATION PROTOCOLS

Mathematics Classroom Observation Protocol for Practices (MCOP2)

Subject: Sample Size: Validated Grades:Mathematics 127 Classroom Observations K-16Brief Description: MCOP2 contains 16 items intended to measure two primaryconstructs student engagement and teacher facilitation. Each item contains afull description of the item with specific requirements for each rating level.Documented Drawbacks: Does not produce a fine-grained analysis. TheMCOP2 was not designed to evaluate a teacher on a single observation due tothe nature and complexity of the teaching.(Gleason & Cofer, 2014; Gleason et al., 2017)

Reformed Teaching Observation Protocol (RTOP)

Subject: Sample Size: Validated Grades:Mathematics andScience

87 observations of 141classrooms

Secondary andPostsecondary (2yr and4yr)

Brief Description: RTOP is a 25 item classroom observation protocol that isstandard based, inquiry oriented, and student centered. Requires a trainedobserver to rate on a Likert scale.Documented Drawbacks: “Though a Likert scale may be helpful to a researcherin quantifying an observation, it is difficult for teachers to know what theyneed to do to improve from a 4 to 5.” (Marshall, Smart, & Horton, 2010)“Exploratory factor analysis showed that some but not all of the individualitems within a given construct loaded together.” (Piburn & Sawada, 2000;Sawada et al., 2002) “RTOP places little emphasis on the accuracy and depthof the content being conveyed during a lesson.” (Walkington et al., 2012) “Theobservers must complete a multiday training program to achieve acceptableinterrater reliability.” (Smith et al., 2013)(Piburn & Sawada, 2000; Sawada et al., 2002)

74

Page 85: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

Oregon Teacher Observation Protocol (OTOP)

Subject: Sample Size: Validated Grades:Mathematics andScience

123 observations of 41classes and 50 classroomobservations

Postsecondary (public andprivate) and Secondary

Brief Description: OTOP is a 10 item protocol designed to generate a profileof what is happening across instructional settings rather than assigning a scoreto a particular lesson. Items are treated as nominal data.Documented Drawbacks: “Despite its supposed reliability in Faculty Fellowsmathematics classes, the OTOP’s scientific nature and lack of recent mathe-matical standards make it undesirable for use in college mathematics courses.”(Gleason & Cofer, 2014)(Flick, Sadri, Morrell, Wainwright, & Schepige, 2009; Morrell, Wainwright, &Flick, 2004)

UTeach Observation Protocol (UTOP)

Subject: Sample Size: Validated Grades:Mathematics andScience

83 observations of 36teachers

Secondary

Brief Description: “The UTOP includes 32 classroom observation indicatorsorganized into four sections: Classroom Environment, Lesson Structure, Im-plementation, and Math/Science Content. The indicators are rated by ob-servers on a 7-point scale: 1 to 5 Likert with Don’t Know (DK) and NotApplicable (NA) options (for some items).” (Walkington et al., 2012)Documented Drawbacks: “Besides the science-specific language, another draw-back to the UTOP is it is solely based off of NCTM standards from 1991.”(Gleason & Cofer, 2014)(Walkington et al., 2012)

75

Page 86: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

Classroom Observation Protocol (COP)

Subject: Sample Size: Validated Grades:Mathematics andScience

1,610 lesson observations K-12

Brief Description: The COP contains several sections where observers describeand classify the major activities, materials, and purposes of a math or sciencelesson, and then it provides four sections where observers rate various aspectsof classroom instruction using a Likert (1-5) scale.Documented Drawbacks: “Due to the large number of evaluators, inter-raterreliability was an issue for classroom observation data. Also, this study wascrosssectional in nature so there are limitations in the design of this study.”(Bowes & Banilower, 2004)(Bowes & Banilower, 2004)

Classroom Observation Protocol for Undergraduate STEM (COPUS)

Subject: Sample Size: Validated Grades:Mathematics andScience (listed asSTEM but nomention of engineeror technologyclassroom testing

30 classroom observations Postsecondary

Brief Description: “COPUS documents classroom behaviors in 2-min intervalsthroughout the duration of the class session. It does not require observers tomake judgments of teaching quality, and it produces clear graphical results.COPUS is limited to 25 codes in two categories (“What the students are doing”and “What the instructor is doing”) and can be reliably used by universityfaculty with only 1.5 hours of training.” (Smith et al., 2013)Documented Drawbacks: “COPUS observations provided a measurement foronly a single class period. From multiple COPUS observations of a singlecourse, we know that it is not unusual to have substantial variations from oneclass to another.” (Wieman & Gilbert, 2014)(Smith et al., 2013)

76

Page 87: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

Teaching Dimensions Observation Protocol (TDOP)

Subject: Sample Size: Validated Grades:Mathematics andScience

Inter-rater reliability resultsfrom TDOP training in thespring of 2012 does notinclude a sample size.

Postsecondary(nonlaboratory courses)

Brief Description: Six dimensions of practice comprise the TDOP: Teachingmethods, Pedagogical strategies, Cognitive demand, Student-teacher interac-tions, Student engagement, and Instructional technology. Observers documentwith 46 codes the classroom behaviors in 2-min intervals throughout the classsession.Documented Drawbacks: “Requires substantial training, as one might expectfor a protocol that was designed to be a complex research instrument.” (Smithet al., 2013) “TDOP does not aim to measure latent variables such as instruc-tional quality, and it is not tied to external criterion such as reform-basedteaching standards.” (Hora, Oleson, & Ferrare, 2013)(Hora et al., 2013)

Classroom Assessment Scoring System - Secondary (CLASS-S)

Subject: Sample Size: Validated Grades:General 1482 lessons observations

(video)6-11

Brief Description: CLASS is a tool for observing and assessing the effective-ness of interactions among teachers and students in classrooms. It measuresthe emotional, organizational, and instructional supports provided by teachersthat have contribute to childrens social, developmental, and academic achieve-ment.Documented Drawbacks: “Does not take into account teaching behaviors spe-cific to the disciplines of mathematics and science, such as placing contentin the “big picture” of the domain, supporting sense-making about conceptsthrough real world connections, and appropriately and powerfully making useof tools of abstraction.” (Walkington et al., 2012)(Allen et al., 2013)

77

Page 88: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

Mathematical Quality of Instruction (MQI)

Subject: Sample Size: Validated Grades:Mathematics 10 teacher observations 2-6Brief Description: MQI is designed to provide scores for teachers on importantdimensions of classroom mathematics instruction. These dimensions includethe richness of the mathematics, student participation in mathematical reason-ing and meaning-making, and the clarity and correctness of the mathematicscovered in class.Documented Drawbacks: “Although there is a significant, strong, and positiveassociation between levels of MKT (mathematical knowledge for teaching)and the mathematical quality of instruction, we also find that there are anumber of important factors that mediate this relationship, either supportingor hindering teachers’ use of knowledge in practice.” (Hill et al., 2008)(Hill et al., 2008)

78

Page 89: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

APPENDIX B

DEMOGRAPHIC CHARACTERISTICS OF THE SAMPLE

A total of 110 observations of 72 instructors was conducted. Only 86 of the 110 ob-

servations have instructor demographics data, because 15 instructors did not complete the

demographics survey.

79

Page 90: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

Table 9

Demographics Characteristics of the Sample

Frequency %GenderMale 60 55Female 50 45Age Range18-24 years old 2 225-34 years old 41 4835-44 years old 15 1745-54 years old 10 1255-64 years old 14 1665 years and over 4 5Race/EthnicitiyAmerican Indian or Alaskan Native 0 0Asian / Pacific Islander 11 13Black or African American 4 5Hispanic American 2 2White / Caucasian 69 80Multiple ethnicity / Other (please specify) 0 0Level of EducationBachelor’s degree 2 2Master’s degree 27 31PhD 54 63Other advanced degree beyond a Master’s degree 3 3Job TitleGraduate Teaching Assistant 13 15Adjunct/Instructor 24 28Assistant Professor 24 28Associate Professor 8 9Full Professor 17 20Number of Years Teaching at High School LevelLess than one year 65 761-5 years 10 126-10 years 5 611-15 years 1 1Over 15 years 5 6Number of Years Teaching at College LevelLess than one year 1 11-5 years 23 276-10 years 27 3111-15 years 11 13Over 15 years 24 28

80

Page 91: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

APPENDIX C

INSTRUMENTS USED

Background Information

1. Institution:

2. Description of course (Calculus I, College Algebra, Analysis, etc.):

3. Gender of instructor:

4. Date of observation:

5. Time of observation:

81

Page 92: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

Abbreviated Reformed Teaching Observation Protocol1

Inquiry Orientation

1. The lesson was designed to engage students as members of a learning community.

Score Description4 Lesson is designed to include both extensive teacher-student and student-

student interactions.3 Lesson is designed for continual interaction between teacher and students.2 Classroom interactions are only teacher-student or student-student.1 Lesson has limited opportunities to engage students. (e.g., rhetorical ques-

tions or shout out opportunities).0 This lesson is completely teacher-centered, lecture only.

2. Intellectual rigor, constructive criticism, and the challenging of ideas were valued.

Score Description4 Students debate ideas through a negotiation of meaning that results in

strong use of evidence/ arguments to support claim.3 Students engaged in a teacher-guided but student driven discussion (“de-

bate”) involving one or more of the following: a variety of ideas, alternativeinterpretations, or alternative lines of reasoning.

2 Students participate in a teacher directed whole-class discussion (debate)involving one or more of the following: a variety of ideas, alternative inter-pretations, or alternative lines of reasoning.

1 At least once the students respond (perhaps by “shout out”) to teach-ers queries regarding alternate ideas, alternative reasoning, or alternativeinterpretations.

0 Students were not asked to demonstrate rigor, offer criticisms, or challengeideas.

3. This lesson encouraged students to seek and value alternative modes of investigationor of problem solving.

Score Description4 Lesson was designed for students to engage in alternative modes and a

clear discussion of these alternatives occurs.3 Lesson was designed for students to engage in alternative modes of inves-

tigation, but without subsequent discussion.2 Lesson was designed for students to ask divergent questions, but not in-

vestigate.1 Lesson was designed for instructor to ask divergent questions (Teacher

directed).0 No alternative modes were explored during the lesson.

1Adapted from (Dayton Regional STEM Center, 2011; Walkington et al., 2012)

82

Page 93: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

4. Students made predictions, estimations and/or hypotheses and devised means for test-ing them.

Score Description4 The students explicitly make, write down or depict, and explain their pre-

diction, estimation and/or hypothesis. Students devise a means for testingtheir prediction, estimation and/or hypothesis.

3 Students discuss predictions. Means for testing is highly suggested.2 Teacher may ask students to predict and wait for input (class as a whole

or as pairs, etc). No means for testing.1 Teacher may ask class to predict as a whole, but doesn’t wait for a response.

No means for testing.0 No opportunities for any predictions (students explaining what happened,

does not mean predicting)

5. The teacher acted as a resource person, working to support and enhance student in-vestigations.

Score Description4 Students are actively engaged in learning process, students determine what

and how, teacher is available to help. The teacher uses student investiga-tion or questions to direct the inquiry process.

3 Students have freedom, but within confines of teacher directed boundaries.Student lead. Teacher answers questions instead of directing inquiry.

2 Primarily directed by teacher with occasional opportunities for students toguide the direction.

1 Very teacher directed, limited student investigation, very rote.0 No investigations (activity that engages students to apply content through

problem solving). Lecture based.

83

Page 94: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

Content Propositional Knowledge

6. The lesson involved fundamental concepts of the subject.

Score Description4 The content covered and/or tasks, examples or activities chosen by the

teacher were clearly and explicitly related to significant concepts to gaina deeper understanding and make worthwhile connections to the mathe-matical or scientific ideas.

3 The content covered and/or tasks, examples or activities chosen by theteacher were clearly related to the significant content of the course, andthe tasks, examples or activities that were used allowed for developmentof worthwhile connections to the mathematical or scientific ideas.

2 The content covered was significant and relevant to the content of thecourse, but the presentation, tasks, examples or activities chosen were pre-scriptive, superficial or contrived and did not allow the students to makemeaningful connections to mathematical or scientific ideas.

1 The content covered and/or tasks, examples or activities chosen by theteacher were distantly or only sometimes related to the content of thecourse. This item should also be rated a 1 if the content chosen wasdevelopmentally inappropriate: either too low-level or too advanced forthe students.

0 The content covered and/or tasks, examples or activities chosen by theteacher were unrelated to the content of the course.

7. The lesson promoted strongly coherent conceptual understanding.

Score Description4 Lesson is presented in a clear and logical manner, relation of content to

concepts is clear throughout and it flows from beginning to end.3 Lesson is predominantly presented in a clear and logical fashion, but rela-

tion of content to concepts is not always obvious.2 Lesson may be clear and/or logical but relation of content to concepts is

very inconsistent (or vice versa).1 Lesson is disjointed and not consistently focused on the concepts.0 Not presented in any logical manner, lacks clarity and no connections be-

tween material.

84

Page 95: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

8. The teacher had a solid grasp of the subject matter content inherent in the lesson.

Score Description4 The teacher clearly understood the content and how to successfully com-

municate the content to the class. The teacher was able to present inter-esting and relevant examples, explain concepts in multiple ways, facilitatediscussions, connect it to the big ideas of the discipline, use advanced ques-tioning strategies to guide student learning, and identify and use commonmisconceptions or alternative ideas as learning tools.

3 The teacher clearly understood the content and how to successfully com-municate the content to the class. The teacher used multiple examples andstrategies to engage students with the content.

2 There were no issues with the teachers understanding of the content and itsaccuracy, but the teacher was not always fluid or did not try to present thecontent in multiple ways. When students appeared confused, the teacherwas unable to re-teach the content in a completely clear, understandable,and/or transparent way such that most students understood.

1 There were several smaller issues with the teachers understanding and/orcommunication of the content that sometimes had a negative impact onstudent learning.

0 There was a significant issue with the teachers understanding and/orcommunication of the content that negatively impacted student learningduring the class.

9. Elements of abstraction (i.e., symbolic representations, theory building) were encour-aged when it was important to do so.

Score Description4 Abstraction is being used for a relevant and useful purpose. Variety of

representation were used to build the lesson and used to support/developthe content. The abstractions are presented in a way such that they areunderstandable and accessible to the class.

3 Teacher uses a variety of abstractions throughout the lesson, and occa-sionally explains them in a manner that supports/develops the content.Perhaps there was a small missed opportunity with respect to facilitatingstudents understanding of abstraction.

2 The teachers use of abstraction was adequate. Teacher uses a variety ofabstractions throughout the lesson, but does not explain them in a mannerthan supports/develops the content

1 The teacher neglects important explanation and discussion of abstractionthat is being used during the class, and this missed opportunity has anegative impact on student learning.

0 There was a major issue with the teachers use of abstraction or no abstrac-tion was presented. This had a negative impact on student learning duringthe class.

85

Page 96: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

10. Connections with other content disciplines and/or real world phenomena were exploredand valued.

Score Description4 Throughout the class, the content was taught in the context of its use in

other disciplines, other areas of mathematics/science, or in the real andthe teacher clearly had deep knowledge about how the content is used inthose areas.

3 The teacher included one or more connections between the content andanother discipline/real world, and the teacher engaged the students in anextended discussion or activity relating to these connections.

2 The teacher connected the content being learned to another discipline/realworld, and the teacher explicitly brought this connection to students at-tention.

1 A minor connection was made to another area of mathematics/science, toanother discipline, or to real-world contexts, but generally abstract or nothelpful for content comprehension. (For example, word problems that canbe solved without the context of the problem.)

0 No connections were made to other areas of mathematics/science or toother disciplines, or connections were made that were inappropriate orincorrect.

86

Page 97: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

Mathematics Classroom Observation Protocol for Practices Descriptors 2

1. Students engaged in exploration/investigation/problem solving.

Score Description3 Students regularly engaged in exploration, investigation, or problem solving.

Over the course of the lesson, the majority of the students engaged inexploration/investigation/problem solving.

2 Students sometimes engaged in exploration, investigation, or problem solv-ing. Several students engaged in problem solving, but not the majority ofthe class.

1 Students seldom engaged in exploration, investigation, or problem solving.This tended to be limited to one or a few students engaged in problemsolving while other students watched but did not actively participate.

0 Students did not engage in exploration, investigation, or problem solving.There were either no instances of investigation or problem solving, or theinstances were carried out by the teacher without active participation byany students.

2. Students used a variety of means (models, drawings, graphs, concrete materials, ma-nipulatives, etc.) to represent concepts.

Score Description3 The students manipulated or generated two or more representations to

represent the same concept, and the connections across the various repre-sentations, relationships of the representations to the underlying concept,and applicability or the efficiency of the representations were explicitlydiscussed by the teacher or students, as appropriate.

2 The students manipulated or generated two or more representations torepresent the same concept, but the connections across the various repre-sentations, relationships of the representations to the underlying concept,and applicability or the efficiency of the representations were not explicitlydiscussed by the teacher or students.

1 The students manipulated or generated one representation of a concept.0 There were either no representations included in the lesson, or represen-

tations were included but were exclusively manipulated and used by theteacher. If the students only watched the teacher manipulate the represen-tation and did not interact with a representation themselves, it should bescored a 0.

2Reprinted by permission from (Gleason, Livers, & Zelkowski, 2015)

87

Page 98: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

3. Students were engaged in mathematical activities.

Score Description3 Most of the students spend two-thirds or more of the lesson engaged in

mathematical activity at the appropriate level for the class. It does notmatter if it is one prolonged activity or several shorter activities. (Notethat listening and taking notes does not qualify as a mathematical activityunless the students are filling in the notes and interacting with the lessonmathematically.)

2 Most of the students spend more than one-quarter but less than two-thirdsof the lesson engaged in appropriate level mathematical activity. It doesnot matter if it is one prolonged activity or several shorter activities.

1 Most of the students spend less than one-quarter of the lesson engaged inappropriate level mathematical activity. There is at least one instance ofstudents’ mathematical engagement.

0 Most of the students are not engaged in appropriate level mathematicalactivity. This could be because they are never asked to engage in anyactivity and spend the lesson listening to the teacher and/or copying notes,or it could be because the activity they are engaged in is not mathematicalsuch as a coloring activity.

4. Students critically assessed mathematical strategies.

Score Description3 More than half of the students critically assessed mathematical strategies.

This could have happened in a variety of scenarios, including in the contextof partner work, small group work, or a student making a comment duringdirect instruction or individually to the teacher.

2 At least two but less than half of the students critically assessed math-ematical strategies. This could have happened in a variety of scenarios,including in the context of partner work, small group work, or a studentmaking a comment during direct instruction or individually to the teacher.

1 An individual student critically assessed mathematical strategies. Thiscould have happened in a variety of scenarios, including in the context ofpartner work, small group work, or a student making a comment duringdirect instruction or individually to the teacher. The critical assessmentwas limited to one student.

0 Students did not critically assess mathematical strategies. This could hap-pen for one of three reasons: 1) No strategies were used during the lesson;2) Strategies were used but were not discussed critically. For example, thestrategy may have been discussed in terms of how it was used on the spe-cific problem, but its use was not discussed more generally; 3) Strategieswere discussed critically by the teacher but this amounted to the teachertelling the students about the strategy(ies), and students did not activelyparticipate.

88

Page 99: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

5. Students persevered in problem solving.

Score Description3 Students exhibited a strong amount of perseverance in problem solving.

The majority of students looked for entry points and solution paths, mon-itored and evaluated progress, and changed course if necessary. Whenconfronted with an obstacle (such as how to begin or what to do next), themajority of students continued to use resources (physical tools as well asmental reasoning) to continue to work on the problem.

2 Students exhibited some perseverance in problem solving. Half of stu-dents looked for entry points and solution paths, monitored and evaluatedprogress, and changed course if necessary. When confronted with an obsta-cle (such as how to begin or what to do next), half of students continuedto use resources (physical tools as well as mental reasoning) to continue towork on the problem.

1 Students exhibited minimal perseverance in problem solving. At least onestudent but less than half of students looked for entry points and solutionpaths, monitored and evaluated progress, and changed course if necessary.When confronted with an obstacle (such as how to begin or what to donext), at least one student but less than half of students continued to useresources (physical tools as well as mental reasoning) to continue to workon the problem. There must be a road block to score above a 0.

0 Students did not persevere in problem solving. This could be because therewas no student problem solving in the lesson, or because when presentedwith a problem solving situation no students persevered. That is to say,all students either could not figure out how to get started on a problem, orwhen they confronted an obstacle in their strategy they stopped working.

89

Page 100: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

6. The lesson involved fundamental concepts of the subject to promote relational/conceptual understanding.

Score Description3 The lesson includes fundamental concepts or critical areas of the course, as

described by the appropriate standards, and the teacher/lesson uses theseconcepts to build relational/conceptual understanding of the students witha focus on the “why” behind any procedures included.

2 The lesson includes fundamental concepts or critical areas of the course,as described by the appropriate standards, but the teacher/lesson missesseveral opportunities to use these concepts to build relational/conceptualunderstanding of the students with a focus on the “why” behind any pro-cedures included.

1 The lesson mentions some fundamental concepts of mathematics, but doesnot use these concepts to develop the relational/conceptual understandingof the students. For example, in a lesson on the slope of the line, theteacher mentions that it is related to ratios, but does not help the studentsto understand how it is related and how that can help them to betterunderstand the concept of slope.

0 The lesson consists of several mathematical problems with no guidanceto make connections with any of the fundamental mathematical concepts.This usually occurs with a teacher focusing on procedure of solving certaintypes of problems without the students understanding the “why” behindthe procedures.

7. The lesson promoted modeling with mathematics.

Score Description3 Modeling (using a mathematical model to describe a real-world situation) is

an integral component of the lesson with students engaged in the modelingcycle (as described in the Common Core State Standards).

2 Modeling is a major component, but the modeling has been turned into aprocedure (i.e. a group of word problems that all follow the same form andthe teacher has guided the students to find the key pieces of informationand how to plug them into a procedure.); or modeling is not a majorcomponent, but the students engage in a modeling activity that fits withinthe corresponding standard of mathematical practice.

1 The teacher describes some type of mathematical model to describe real-world situations, but the students do not engage in activities related tousing mathematical models.

0 The lesson does not include any modeling with mathematics.

90

Page 101: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

8. The lesson provided opportunities to examine mathematical structure. (Symbolic no-tation, patterns, generalizations, conjectures, etc.)

Score Description3 The students have a sufficient amount of time and opportunity to look for

and make use of mathematical structure or patterns.2 Students are given some time to examine mathematical structure, but are

not allowed adequate time or are given too much scaffolding so that theycannot fully understand the generalization.

1 Students are shown generalizations involving mathematical structure, buthave little opportunity to discover these generalizations themselves or ad-equate time to understand the generalization.

0 Students are given no opportunities to explore or understand the mathe-matical structure of a situation.

9. The lesson included tasks that have multiple paths to a solution or multiple solutions.

Score Description3 A lesson which includes several tasks throughout; or a single task that takes

up a large portion of the lesson; with multiple solutions and/or multiplepaths to a solution and which increases the cognitive level of the task fordifferent students.

2 Multiple solutions and/or multiple paths to a solution are a significantpart of the lesson, but are not the primary focus, or are not explicitlyencouraged; or more than one task has multiple solutions and/or multiplepaths to a solution that are explicitly encouraged.

1 Multiple solutions and/or multiple paths minimally occur, and are not ex-plicitly encouraged; or a single task has multiple solutions and/or multiplepaths to a solution that are explicitly encouraged.

0 A lesson which focuses on a single procedure to solve certain types of prob-lems and/or strongly discourages students from trying different techniques.

10. The lesson promoted precision of mathematical language.

Score Description3 The teacher “attends to precision” in regards to communication during the

lesson. The students also “attend to precision” in communication, or theteacher guides students to modify or adapt non-precise communication toimprove precision.

2 The teachers “attends to precision” in all communication during the lesson,but the students are not always required to also do so.

1 The teacher makes a few incorrect statements or is sloppy about mathe-matical language, but generally uses correct mathematical terms.

0 The teacher makes repeated incorrect statements or incorrect names formathematical objects instead of their accepted mathematical names.

91

Page 102: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

11. The teacher’s talk encouraged student thinking.

Score Description3 The teacher’s talk focused on high levels of mathematical thinking. The

teacher may ask lower level questions within the lesson, but this is notthe focus of the practice. There are three possibilities for high levels ofthinking: analysis, synthesis, and evaluation. Analysis: examines/ inter-prets the pattern, order or relationship of the mathematics; parts of theform of thinking. Synthesis: requires original, creative thinking. Evalu-ation: makes a judgment of good or bad, right or wrong, according to thestandards he/she values.

2 The teacher’s talk focused on mid-levels of mathematical thinking. In-terpretation: discovers relationships among facts, generalizations, defini-tions, values and skills. Application: requires identification and selectionand use of appropriate generalizations and skills.

1 Teacher talk consists of “lower order” knowledge based questions andresponses focusing on recall of facts. Memory: recalls or memorizes infor-mation. Translation: changes information into a different symbolic formor situation.

0 Any questions/ responses of the teacher related to mathematical ideas wererhetorical in that there was no expectation of a response from the students.

12. There were a high proportion of students talking related to mathematics.

Score Description3 More than three quarters of the students were talking related to the math-

ematics of the lesson at some point during the lesson.2 More than half, but less than three quarters of the students were talking

related to the mathematics of the lesson at some point during the lesson.1 Less than half of the students were talking related to the mathematics of

the lesson.0 No students talked related to the mathematics of the lesson.

13. There was a climate of respect for what others had to say.

Score Description3 Many students are sharing, questioning, and commenting during the les-

son, including their struggles. Students are also listening (active), clarify-ing, and recognizing the ideas of others.

2 The environment is such that some students are sharing, questioning, andcommenting during the lesson, including their struggles. Most studentslisten.

1 Only a few share as called on by the teacher. The climate supports thosewho understand or who behave appropriately. Or Some students are shar-ing, questioning, or commenting during the lesson, but most students areactively listening to the communication.

0 No students shared ideas.

92

Page 103: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

14. In general, the teacher provided wait-time.

Score Description3 The teacher frequently provided an ample amount of “think time” for the

depth and complexity of a task or question posed by either the teacher ora student.

2 The teacher sometimes provided an ample amount of “think time” forthe depth and complexity of a task or question posed by either the teacheror a student.

1 The teacher rarely provided an ample amount of “think time” for thedepth and complexity of a task or question posed by either the teacher ora student.

0 The teacher never provided an ample amount of “think time” for thedepth and complexity of a task or question posed by either the teacher ora student.

15. Students were involved in the communication of their ideas to others (peer-to-peer).

Score Description3 Considerable time (more than half) was spent with peer to peer di-

alog (pairs, groups, whole class) related to the communication of ideas,strategies and solution.

2 Some class time (less than half, but more than just a few minutes)was devoted to peer to peer (pairs, groups, whole class) conversations re-lated to the mathematics.

1 peer to peer (pairs, groups, whole class) conversations. A few instancesdeveloped where this occurred during the lesson but only lasted less than5 minutes.

0 No peer to peer (pairs, groups, whole class) conversations occurred duringthe lesson.

16. The teacher uses student questions/comments to enhance conceptual mathematicalunderstanding.

Score Description3 The teacher frequently uses student questions/ comments to coach stu-

dents, to facilitate conceptual understanding, and boost the conversation.The teacher sequences the student responses that will be displayed in an in-tentional order, and/or connects different students’ responses to key math-ematical ideas.

2 The teacher sometimes uses student questions/ comments to enhanceconceptual understanding.

1 The teacher rarely uses student questions/ comments to enhance con-ceptual mathematical understanding. The focus is more on proceduralknowledge of the task verses conceptual knowledge of the content.

0 The teacher never uses student questions/ comments to enhance concep-tual mathematical understanding.

93

Page 104: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

Note Taking Form

Observation number

Random Order

Date and Time:

Class name/description:

Number of Students:

1. Are students engaged? How many students are actively participating in the lesson?

(a) Exploring and problem solving

(b) Using a variety of means (abstractions)

(c) Assessing mathematical strategy

(d) Overcoming road blocks

2. What is the interaction between student and teacher? Between student peers?

(a) Talking related to mathematics (How many?)

(b) Respecting others’ ideas (How many sharing and/or listening?)

3. How is the content presented?

(a) Lesson structure (Direct lecture, discussion/debate, student led)

(b) Alternative methods (Multiple paths to a solution or multiple solutions)

(c) Abstractions connected

(d) Wait time provided to reason, make sense, and articulate

4. What is the content covered or task, examples, and activities?

(a) Fundamental (What to do and Why?)

94

Page 105: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

(b) Added value and relevant

(c) Examined math structure (generalizations examined)

(d) connected and flowed smoothly

(e) Connected with other areas of mathematics, other disciplines, or real world

5. Did the instructor have a solid grasp of the material?

(a) Used precision of mathematical language

(b) Enhanced content with student comments

(c) Talk encouraged student thinking (level)

95

Page 106: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

APPENDIX D

REGRESSION MODELS AND RESIDUAL PLOTS

96

Page 107: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

Figure 6: Regression Model 1: Student Engagement and Inquiry Orientation

97

Page 108: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

Figure 7: Regression Model 2: Student Engagement and Inquiry Orientation

98

Page 109: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

Figure 8: Regression Model 3: Teacher Facilitation and and Inquiry Orientation

99

Page 110: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

Figure 9: Regression Model 4: Teacher Facilitation and Content Propositional Knowledge

100

Page 111: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

Figure 10: Regression Model 5: Inquiry Orientation and Content Propositional Knowledge

101

Page 112: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

Figure 11: Regression Model 6: Student Engagement and Teacher Facilitation

102

Page 113: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

APPENDIX E

IRB CERTIFICATIONS

See following pages for copies of IRB Certifications.

103

Page 114: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

Office for Research

Institutional Review Board for the Protection of Human Subjects

THE UNIVERSITY OF

ALABAMA R E S E A R C H

358 Rose Administration Building Box 870127

Tuscaloosa, Alabama 35487-0127 ( 205) 348-8461

fAX (205) 348-7189 TOLi fREE (877) 820-3066

February 8, 2016

Laura Watley, M.A. Department of Mathematics College of Arts & Sciences Th Univ rsity of Alabama Box 870350

Re: !RB# EX-16-CM-015 "Structural Validity and Reliability of Two Observation Protocols in College Mathematics"

Dear Ms. Watley:

The University of Alabama Institutional Review Board has granted approval for your proposed research.

Your protocol has been given exempt approval according to 45 CPR part 46.101(b)(1) as outlined below:

(1) Research conducted in established or common!J accepted educational settings, involvingnormal ed11cational practim� such as (t) research on regular and special educationinstr'llctional strategies, ot· (ii) reseanh on the effectiveness ef or the comparison amonginstt,,,tional techniques, ct11riqda, or classroom management methods.

Your application will expire on February 7, 2017. If your research will continue beyond this date, complete the relevant portions of Continuing Review and Closure From. If you wish to modify the application, complete the Modification of an Approved Protocol Form. When the study closes, complete the appropriate portions of ORM: Continuing Review and Closure.

Should you need to submit any further correspondence regarding this proposal, please include the assigned !RB application number.

Good luck with your research.

Sincerely,

104

Page 115: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

105

Page 116: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

106

Page 117: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

THE UNIVERSITY OF

ALABAMA

January 11, 2017

Laura Watley, M.A.Department of MathematicsCollege of Arts & Sciences The University of AlabamaBox 870350

Office of the Vice President for

Research & Economic Development Office for Research Compliance

Re: IRB # EX-16-CM-O 15-R 1 "Structural Validity and Reliability of Two Observation Protocols inCollege Mathematics"

Dear Ms. Watley:

The University of Alabama Institutional Review Board has granted approval for your renewal application.Your renewal application has been given exempt approval according to 45 CFR part 46.lOl(b)(l) as outlined below:

(1) Research conducted in established or commonly accepted educational settings, involving normaleducational practices, such as (i) research on regular and special education instructional strategies, or(ii) research on the effectiveness of or the comparison among instructional techniques, curricula, orclassroom management methods.

Your application will expire on January 10, 2018. If your research will continue beyond this date,complete the relevant portions of Continuing Review and Closure From. If you wish to modify theapplication, complete the Modification of an Approved Protocol Form. When the study closes, completethe appropriate portions of FORM: Continuing Review and Closure.

Should you need to submit any further correspondence regarding this proposal, please include theassigned IRB application number.

Good luck with your research.

Sincerely,

358 Rose Administration Building I Box 870127 I Tuscaloosa, AL 35487-0127

205-348-8461 I Fax 205-348-7189 I Toll Free 1-877-820-3066

107

Page 118: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

108

Page 119: STRUCTURAL VALIDITY AND RELIABILITY OF TWO …jgleason.people.ua.edu/uploads/3/8/3/4/38349129/dissertationwatley2-6-17.pdfDirector, Dr. David Cruz-Uribe and Dr. David Halpern, your

109