Handbook for Professional Development in Assessment Literacy

Handbook For Professional Development In Assessment

Literacy: A Resource For States and School Districts

NCSA Conference

June 20, 2013

Project of the Title I Comprehensive

Assessment Systems CCSSO SCASS

Wayne Neuburger (PhD) Advisor

REVISION OF THE HANDBOOK

FOR PROFESSIONAL

DEVELOPMENT IN

ASSESSMENT LITERACY

T1-CAS Mission

•The Title I Comprehensive Assessment Systems SCASS supports states in efforts to use assessment and accountability systems to support the improvement of education in schools using ESEA funds.

•As a national consortium of assessment and Title I professionals, T1-CAS addresses issues in standards, assessment, and accountability systems, and the effects of these systems on the education of Title I students.

Presenters

• Moderator: Wayne Neuburger, Consultant

• Project Director: Jan Sheinker, Consultant

• Author: Doris Redfield, Consultant

• Reviewer: Beth Cipoletti, West Virginia

Dept of Ed

• Discussant: Elizabeth Davis, Alaska Dept of

Ed

Jan Sheinker, Ed.D. Doris Redfield, Ph.D.

Developed Under the Direction of: Phoebe C. Winter Project Director

And

The Professional Development for Assessment Literacy Study Group Comprehensive Assessment Systems for IASA Title I State Collaborative on Assessment and Student Standards Council of Chief State School Officers

Purpose 2001 and 2013

• It is intended to provide a resource to states and districts as they deploy state and district assessment systems aligned with standards for purposes of improving student learning through accountability and school improvement .

• We hope that this document also serves as a resource for informing many constituents of education about the purposes of assessment and the importance of an aligned assessment system to the overall educational system .

SCASS GROUPS PROVIDING

INPUT

• Formative Assessment for Students and

Teachers (FAST)

• Assessing Special Education Students

(ASES)

• English Language Learners (ELL)

• Accountability Systems and Reporting

(ASR)

• Technical Issues in Large-Scale Assessment

(TILSA)

Need for Revision

• Significant changes have occurred in

assessment systems since 2001.

• Accountability systems have changed since

2001.

• Assessment technical issues have

progressed to address new assessment and

accountability systems.

• Distribution systems are more sophisticated

Organization and Content of the

Handbook

Jan Sheinker

[email protected]

Organization – Actually FOUR

documents Documents Users Uses

PDF

States, districts, schools, and

other interested audiences

To preserve the original document

Word

And

Ppt

(Static &

Animated)

State and district PD

presenters

For customization by individual

states, districts, & schools as PD

scripts and handouts for PD

presenters

Individual states, districts, &

schools

For customization by individual

states, districts, & schools as district

and school newsletter inserts to

inform parents and community

PowerPoint for use in

Trainings & Presentations

Table of Contents

Guide to Using the Handbook

Chapter One: Why Build an Assessment System?

Chapter Two: What Is Technical Quality?

Chapter Three: How Are the Purposes of Assessment Related to Technical Quality?

Chapter Four: How Are the Uses of Assessment Related to Technical Quality?

Chapter Five: How Do Schools/States Report Results in Proper Context?

Chapter Six: How Should Results Be Used to Make Decisions?

Appendices

Glossary

References

Guide to Using the Handbook

Using the Handbook

Audiences for the Handbook

State Department Personnel

Schoolwide Planning Participants

Local School Boards and Administrators

Legislative Committees

Regional Professional Development Participants

Members of Professional Associations

Pre-Service and Graduate Students in Education

Community Members

Customizing

CONCLUSION

Chapter One: Why Build an Assessment

System?

• Standards Aligned Assessment Systems – How have aligned assessment systems evolved?

– How have Common Core State Standards influenced the development of assessment systems?

• Relationships of Tests to Assessment Systems to Accountability Systems

– What are comprehensive assessment systems?

– What are the purposes and relationships among formative assessment strategies and classroom, school, district, and state tests?

– How do assessment systems relate to accountability systems?

– Why use an assessment system instead of individual tests to set goals and make decisions?

• Developing a Comprehensive System – What is the role of formative assessment strategies in a comprehensive system?

– What is the role of classroom tests in a comprehensive system?

– What is the role of interim and benchmark tests in a comprehensive system?

– What is the role of district tests in a comprehensive system?

– What is the role of English language proficiency tests in a comprehensive system?

– What is the role of state tests in a comprehensive system?

• Assessment Systems and the Classroom – How do assessment and accountability systems relate to the way we do business in classrooms?

– Who does what in standards-based schools?

Chapter Two: What is Technical

Quality? • Aspects of Technical Quality

• Validity – Why is validity important?

– How can validity be increased?

• Reliability

• Fairness, Bias and Accessibility

• Comparability

• Procedures for test administration, scoring, data analysis, and reporting

• Evaluation of the technical quality of accommodations

Chapter Three: How are the Purposes of

Assessment Related to Technical Quality?

• Alignment of Assessments with Standards – What is alignment with the purpose of the assessment?

– Why are specific characteristics important to alignment?

– Why are vertical and horizontal alignment important?

• Accountability Purposes – Why is accountability needed?

– Who is accountable for student learning?

– For what should schools be held accountable?

– What is accountability for growth?

• Assessments Purposes – Why do assessment systems include different types or combinations of tests?

– What are norm-referenced tests?

– What are standards-based tests?

– What are augmented assessment systems?

– What are computer adaptive tests?

– What are APIP enabled technology enhanced assessment systems?

Chapter Four: How are the Uses of

Assessment Results Related to Technical

Quality?

• Using Results Appropriately to Avoid Misuses – What are appropriate uses and potential misuses of standards-based tests?

– What are appropriate uses and potential misuses of norm-referenced tests?

– What are appropriate uses and potential misuses of interim/benchmark tests?

– What are appropriate uses and potential misuses of classroom tests?

– What are appropriate uses and potential misuses of formative assessment strategies?

• Usability of Results – What should be considered in determining the usability of results?

– What factors affect the credibility of results?

– What factors affect the accuracy of score interpretation?

– What results are needed to adjust instruction?

– What factors affect the usefulness of results for teacher and leader evaluation?

Chapter Five: How Do Schools/States

Report Results in Proper Context?

Reporting Results

- How are indicators selected for reporting?

- What indicators provide direct measures of student achievement of

standards?

- What other indicators provide direct measures of student knowledge?

- What student learning indicators provide indirect measures of student

performance?

Reporting Related Indicators

- What indicators provide measures of opportunity to learn?

- What context variables affect student learning?

Cautions for Reporting Results

Sampling and Sample Size

- How does sampling affect the technical quality of a tests?

- How does population and sample size affect the technical quality of a test?

Chapter Six: How Should Results Be

Used to Make Decisions?

• Using Results to Make Decisions

• Using Results to Make Decisions About School Improvement – How can the results be used to profile student performance?

– How is the profile used to set school improvement goals?

– What is considered in developing a plan to achieve the goals?

– How are school improvement results monitored and documented?

• Using Results to Make Decisions about Policy Changes and Evaluating program effectiveness – How can the system be monitored and evaluated?

– How can results be used to make policy decisions concerning resources?

– How can the delivery system be altered to improve results?

• Using Results to Make Decisions About Rewards and Sanctions

Technical Quality

(Chapter 2: Handbook for

Professional Development in

Assessment Literacy)

Doris Redfield, Ph.D.

[email protected]

304-344-3083

mailto:[email protected]

Contents • Aspects of Technical Quality

• Validity

– Importance

– Ways to increase

• Reliability

• Fairness, Bias, & Accessibility

• Comparability

• Procedures: Test Administration, Scoring, Data

Analysis, Reporting

• Evaluation of Accommodations Use

What is Technical Quality (TQ)

• The integrity of each assessment,

instrument, process, & procedure for

contributing to fair and defensible

decisions.

Aspects of TQ

• Validity (accuracy – test measures what it purports

to measure)

• Reliability (consistency – whatever is measured is

consistently measured across circumstances)

• Other

– Fairness & accessibility

– Comparability

– Procedures for test administration, scoring, data

analysis & reporting

– Interpretation & use of results

TECHNICAL QUALITY IN

ASSESSMENT SYSTEMS

Rigorous standards and standards-aligned tests

Valid, reliable, comparable, fair, and accessible

tests that include ALL students

Technically sound administration and scoring

Accurate reporting and interpretation of results

Additions and/or Increased

Emphasis

• Methodologies

– Comparability

– Generalizability

– Standard Setting

• Attention to

– Peer Review Guidance

– Joint Standards for Educational & Psychological

Testing

– Evaluation of the TQ of Accommodations

VALIDITY

• Are we measuring

what we say we are

measuring?

• Can we make valid

interpretations/

inferences?

Validity

• Four broad categories (Standards for

Educational & Psychological Testing)

1. Test content (content validity)

2. Relationship to other variables or criteria

3. Student response processes

4. Internal structure of the assessment

Validity: Peer Review Guidance Questions

(Section 4.2)

Has the State

• Specified the purposes of the assessments . . . ?

• Ascertained that the assessments measure the knowledge and skills described

in its academic content standards . . . ?

• Ascertained that its assessment items are tapping the intended cognitive

processes & are at the appropriate grade level?

• Ascertained that scoring and reporting structures are consistent with the sub-

domain structures of its academic content standards?

• Ascertained that test & item scores are related to outside variables as

intended?

• Ascertained that the decisions based on assessment results are consistent with

the purposes for which the assessments were designed?

• Ascertained whether the assessment produces intended and unintended

consequences?

KEY: Provide clear & explicit documentation of evidence

INCREASING VALIDITY

• Multiple measures

• Face validity

• Content validity: Breadth; depth; cognitive complexity; range

of knowledge & skills, including thinking skills, represented by

the content standards

• Construct validity

– Evidence of convergent criterion validity or the relationship of

particular items to the other items on the same test/assessment

(internal consistency reliability

– Evidence of divergent criterion validity

– Expert review

Reliability

• Test reliability: reliability coefficients;

standard errors of measurement (SEMs);

confidence bands

• Rater reliability: inter- & intra-rater

• Standard setting and reliability of cut scores

– At least 3 levels required

– Descriptors for each level required

– Methods: Angoff (& modifications);

Bookmark; Body of Work

STANDARD SETTING

METHODS

METHOD METHODOLOGY

Angoff

Panel of content experts

Sum of item averages

Raw cut score

Bookmark

Statistical item difficulty

Ordered item booklets reviewed by Panel of raters

Panel “Bookmarks” cut points

IRT based cut scores

Body of Work

Student response booklets reviewed by Panel of

content experts

Identify knowledge and skills

Assign achievement level

Final cut score recommendations

Reliability

• Test reliability: reliability coefficients;

standard errors of measurement (SEMs);

confidence bands

• Rater reliability: inter- & intra-rater

• Standard setting and reliability of cut scores

– At least 3 levels required

– Descriptors for each level required

– Methods: Angoff (& modifications);

Bookmark; Body of Work

Reliability Cont’d . . .

• Generalizability: extent to which reliability findings can

be replicated across situations

• Purpose specific

• Peer Review Guidance Questions (Section 4.2) – Has the

State

‾ Determined the reliability of the scores it reports . . . ?

‾ Quantified & documented the conditional SEM and student

classification consistent at each cut score in the State’s academic

achievement standards?

‾ Reported generalizability evidence for all relevant sources, . . . ?

KEY: Provide clear & explicit documentation of

evidence

Fairness, Bias, & Accessibility

Peer Review Guidance Question (Section 4.3) – Has

the State

• Ensured that the assessments provide an appropriate

variety of accommodations . . . ?

• Ensured that the assessments provide an appropriate

variety of linguistic accommodations . . . ?

• Taken steps to ensure fairness in the development of the

assessments?

• Used accommodations &/or alternate assessments to yield

meaningful scores?

Most Likely Sources of Unfairness (Standards for Educational & Psychological Tests)

• Items or tasks do not provide an equal

opportunity for all students to demonstrate their

knowledge & skills fully.

• The assessments are not administered in ways

that ensure fairness.

• The results are not reported in ways that ensure

fairness.

• The results are not interpreted or used in ways

that lead to equitable treatment of those affected

by the results.

Key Sources of Challenge to

Fairness, Bias, & Accessibility

• Language loading

• Background knowledge requirements

• Cultural bias

• Other factors specific to certain students

FAIRNESS AND BIAS

FAIRNESS

of

assessment

BIAS of

items/results

for purpose

for group

for use

for male/female

for English language

learners

for racial or ethnic

minorities

for students with disabilities

Fairness and Access

• Access through

accommodations

– For students with

disabilities

– For English language

learners

– That yield meaningful

scores

• Fairness in

– Item/task design

– Test administration

– Reporting results

– Interpretation of results

Fairness, Access & Bias

in Items & Tasks language loading: use of vocabulary not required by the content that impedes

access to assessment items and tasks for students who lack proficiency in

language, either due to their having another primary language, a learning

disability, or delayed language development.

background knowledge requirements: design or content of assessment items or

tasks that impedes access based on assumptions about what knowledge students

bring to the assessment situation apart from instruction.

other: incorporation of other factors in assessment item and task design that

increase the challenge of performing correctly in ways unrelated to the

content being assessed.

cultural bias: inclusion of situations and contexts in assessment items and tasks that

may be misinterpreted by students from different cultural backgrounds in ways that

interfere with performance.

Comparability

• An aspect of reliability that allows for comparisons from year to year, student to student, school to school, form to form, etc.

• Methodologies – Linking: The relationship between test scores on two

different tests that are not necessarily built to have the same content or level of difficulty

– Equating: A type of linking that provides the strongest possible linking relationship, rendering test scores across different forms of the same test interchangeable.

– Scaling: A process for transforming raw scores to scores on another scale (e.g., SAT scores). Item Response Theory (IRT) is one example of a type of scaling.

Redfield

Comparability: Peer Review Guidance

Questions (Section 4.4)

• Has the State taken steps to ensure

consistence of test forms over time?

• If the state administers both an online and

paper-and-pencil test, has the State

documented the comparability of the

electronic and paper forms?

Procedures: Test Administration,

Scoring, Data Analysis, & Reporting

• Peer Review Guidance Questions (Section 4.5)

– Has the State established clear criteria for the

administration, scoring, analysis, and reporting

components of the assessment system,

including all alternate assessments?

– Does the State have a system for monitoring

and improving the on-going quality of its

assessment system?

CONSISTENT TEST PROCEDURES

administer

score

analyze

Uniform presentation

Structured accommodations

Consistently applied

scoring rules

Uniform data analysis

Consistently applied

reporting rules and templates report

Procedures specified in policies and

guidelines

Implementation monitored and findings

documented

Security prescribed and enforced

Improvements planned and implemented

Administration, Scoring, Analysis &

Reporting

Evaluating the TQ of Accommodations

• States must provide for the use of

appropriate accommodations AND must

have conducted studies confirming that

scores from accommodated assessments can

be meaningfully combines with the scores

from non-accommodated administrations of

the assessment.

TEST ACCOMMODATIONS

Are

Changes in setting, schedule, timing,

presentation, response, etc.

Intended to equalize opportunity to perform

Consistent with instructional experiences

Consistently and securely administered

Valid and reliable

Peer Review Guidance Questions

(Section 4.6)

How has the State • Ensured that appropriate accommodations are available to students

with disabilities & students covered by Section 504 AND that these

accommodations are used in a manner consistent with the student’s

IEP or 504 plan?

• Determined that scores for students with disabilities based on

accommodated test administrations will allow for valid inferences

about the student’s knowledge & skills AND can be combined

meaningfully with scores from non-accommodated administrations?

• Ensured that appropriate accommodations are available to limited

English proficient (LEP) students . . . ?

• Determined that scores for LEP students based on accommodated

administrations will allow for valid inferences . . . ?n

Using the Handbook to Guide

your Assessment

Beth Cipoletti, Ed.D.

Assistant Director

Office of Assessment and Accountability

Directions

Anticipated

• Easy trip

– Have directions

– Have GPS

– Experience with route

– Know best places to stop

– Be in front of the storm

• Reading, PA by 6:30 p.m.

Anticipated

Anticipated

Reality

• Office meeting ran longer than expected

– Departure from Charleston later than planned

• Storm started earlier than forecast

– Snow in the mountains

– Rain to the east

• Limited visibility

• Poor road conditions

Reality

Reality

Changing Landscape

• New world of assessment and

accountability systems

• Growth to standards adds a new dimension

• Valid systems are more complex

• Different tools to measure school and

teacher effectiveness

Anticipated Work

• Stakeholder information

• Systems monitoring

• Validation of technical adequacy of

assessments

• Peer Review

• Changes in federal requirements

Reality

• Work does not start on time

• Changes happen faster than anticipated

• Inaccurate vision of future

• More time is needed to finish than expected

Handbook (GPS)

• Resource on assessment and accountability

• Straightforward and plainspoken language

• Overview of relevant topics

• References and links to documents for more

in-depth study

Using the Handbook

• Determine course of action

• Respond to specific questions

• Include sections and parts in responses to

inquiries or newsletters

– May qualify, modify or make state specific

• Incorporate PowerPoint slides into

presentations

Elizabeth Davis

Discussant

Alaska Department of Education

[email protected]

Next Steps

• Finalize Revisions with T1-CAS

• CCSSO Review and Publication

• Handbook made available to T1-CAS

members

• CCSSO makes Handbook an Official

Publication for distribution

• Possible Webinar to acquaint states with

Handbook

For information on joining T1-CAS, please

contact

• Joe Crawford

Program Associate, CCSSO

[email protected]

202-312-6436 (Office)

330-687-1185 (Mobile)

65

mailto:[email protected]

For information on activities of

T1-CAS,

please contact

• Wayne Neuburger

T1 – CAS Program Director

[email protected]

503-390-8045

503-580-5779 (Mobile)

66

Questions?