Diagnostic Metrics, Part 1 Week 2 Video 2. Different Methods, Different Measures Today we’ll...

Diagnostic Metrics, Part 1

Week 2 Video 2

Different Methods, Different Measures

Today we’ll focus on metrics for classifiers

Later this week we’ll discuss metrics for regressors

And metrics for other methods will be discussed later in the course

Metrics for Classifiers

Accuracy

One of the easiest measures of model goodness is accuracy

Also called agreement, when measuring inter-rater reliability

# of agreementstotal number of codes/assessments

Accuracy

There is general agreement across fields that accuracy is not a good metric

Accuracy

Let’s say that my new Kindergarten Failure Detector achieves 92% accuracy

Good, right?

Non-even assignment to categories Accuracy does poorly when there is non-

even assignment to categories Which is almost always the case

Imagine an extreme case 92% of students pass Kindergarten My detector always says PASS

Accuracy of 92% But essentially no information

(Agreement – Expected Agreement) (1 – Expected Agreement)

Computing Kappa (Simple 2x2 example)

DetectorOff-Task

DetectorOn-Task

DataOff-Task

DataOn-Task

• What is the percent agreement?

DetectorOff-Task

DetectorOn-Task

DataOff-Task

DataOn-Task

• What is the percent agreement?• 80%

DetectorOff-Task

DetectorOn-Task

DataOff-Task

DataOn-Task

• What is Data’s expected frequency for on-task?

DetectorOff-Task

DetectorOn-Task

DataOff-Task

DataOn-Task

• What is Data’s expected frequency for on-task?• 75%

DetectorOff-Task

DetectorOn-Task

DataOff-Task

DataOn-Task

• What is Detector’s expected frequency for on-task?

DetectorOff-Task

DetectorOn-Task

DataOff-Task

DataOn-Task

• What is Detector’s expected frequency for on-task?• 65%

DetectorOff-Task

DetectorOn-Task

DataOff-Task

DataOn-Task

• What is the expected on-task agreement?

DetectorOff-Task

DetectorOn-Task

DataOff-Task

DataOn-Task

• What is the expected on-task agreement?• 0.65*0.75= 0.4875

DetectorOff-Task

DetectorOn-Task

DataOff-Task

DataOn-Task

• What is the expected on-task agreement?• 0.65*0.75= 0.4875

DetectorOff-Task

DetectorOn-Task

DataOff-Task

DataOn-Task

15 60 (48.75)

• What are Data and Detector’s expected frequencies for off-task behavior?

DetectorOff-Task

DetectorOn-Task

DataOff-Task

DataOn-Task

15 60 (48.75)

• What are Data and Detector’s expected frequencies for off-task behavior?• 25% and 35%

DetectorOff-Task

DetectorOn-Task

DataOff-Task

DataOn-Task

15 60 (48.75)

• What is the expected off-task agreement?

DetectorOff-Task

DetectorOn-Task

DataOff-Task

DataOn-Task

15 60 (48.75)

• What is the expected off-task agreement?• 0.25*0.35= 0.0875

DetectorOff-Task

DetectorOn-Task

DataOff-Task

DataOn-Task

15 60 (48.75)

• What is the expected off-task agreement?• 0.25*0.35= 0.0875

DetectorOff-Task

DetectorOn-Task

DataOff-Task

20 (8.75) 5

DataOn-Task

15 60 (48.75)

• What is the total expected agreement?

DetectorOff-Task

DetectorOn-Task

DataOff-Task

20 (8.75) 5

DataOn-Task

15 60 (48.75)

• What is the total expected agreement?• 0.4875+0.0875 = 0.575

DetectorOff-Task

DetectorOn-Task

DataOff-Task

20 (8.75) 5

DataOn-Task

15 60 (48.75)

• What is kappa?

DetectorOff-Task

DetectorOn-Task

DataOff-Task

20 (8.75) 5

DataOn-Task

15 60 (48.75)

• What is kappa?• (0.8 – 0.575) / (1-0.575)• 0.225/0.425• 0.529

DetectorOff-Task

DetectorOn-Task

DataOff-Task

20 (8.75) 5

DataOn-Task

15 60 (48.75)

So is that any good?

• What is kappa?• (0.8 – 0.575) / (1-0.575)• 0.225/0.425• 0.529

DetectorOff-Task

DetectorOn-Task

DataOff-Task

20 (8.75) 5

DataOn-Task

15 60 (48.75)

Interpreting Kappa

Kappa = 0 Agreement is at chance

Kappa = 1 Agreement is perfect

Kappa = negative infinity Agreement is perfectly inverse

Kappa > 1 You messed up somewhere

Kappa<0

This means your model is worse than chance

Very rare to see unless you’re using cross-validation

Seen more commonly if you’re using cross-validation It means your model is junk

0<Kappa<1

What’s a good Kappa?

There is no absolute standard

0<Kappa<1

For data mined models, Typically 0.3-0.5 is considered good enough

to call the model better than chance and publishable

In affective computing, lower is still often OK

Why is there no standard?

Because Kappa is scaled by the proportion of each category

When one class is much more prevalent Expected agreement is higher than

If classes are evenly balanced

Because of this…

Comparing Kappa values between two data sets, in a principled fashion, is highly difficult It is OK to compare Kappa values within a data set

A lot of work went into statistical methods for comparing Kappa values in the 1990s

No real consensus

Informally, you can compare two data sets if the proportions of each category are “similar”

• What is kappa?A: 0.645B: 0.502C: 0.700D: 0.398

DetectorInsult during Collaboration

DetectorNo Insult during Collaboration

DataInsult

DataNo Insult

• What is kappa?A: 0.240B: 0.947C: 0.959D: 0.007

DetectorAcademic Suspension

DetectorNo Academic Suspension

DataSuspension

DataNo Suspension

Next lecture

ROC curves A’ Precision Recall

Diagnostic Metrics, Part 1 Week 2 Video 2. Different Methods, Different Measures Today we’ll...

Documents

Empirical Validation of OO Metrics in Two Different Iterative Software Processes

Standardised impact metrics for the off-grid energy sector · PDF fileStandardised impact metrics for the off-grid energy ... result different organisations use different approaches

Travel the Road to Greatness. We’ll be your guide. large-scale organizational change, metrics, and employee engagement. Our communications consulting services will guide you to …

Tweets and Mendeley readers: Two different types of article level metrics

CPA Firm Success Metrics: Understanding Practice Economics · 2016. 11. 2. · the reasons and need for CPA firm success metrics •Then, we’ll gain a deeper understanding of the

Hire in-house vs recruitment company · Here, we’ll break down hiring using in-house resources versus outsourcing the task to a recruitment . agency. We’ll look at the different

people a dictionary of metrics - VPSC€¦ · A DICTIONARY OF PEOPLE METRICS Sector organisations are at different stages of development in their use of metrics and collection of

Did you know? - meiassets.blob.core.windows.net€¦ · Did you know?? Being able to express equations in different forms gives us different information Later we’ll be looking at

Allot See-Control -Secure · Network Metrics All Network Aspects •Usage Statistics •QoENetwork Metrics •Top Applications •….. Different Views •Per Network •Per Device

Topographic Metrics Many Topographic metrics have been proposed. We’ll examine the three most common –Channel Steepness Index –Hillslope Gradients

Sustainability Metrics in the Process Industries - · PDF fileTHE SUSTAINABILITY METRICS ... allow comparison between different operations. For example, in the environmental area,

Metrics - CHAOSS · Released metrics are only a subset of metric ideas that are being developed. If you would like to learn more and discuss different metrics please visit the working

Society of Research Administrators International Webinars ...€¦ · Metrics Tracked • Metrics from 5 different categories: usage, captures, social media, mentions & citations

CEO Guide To SaaS Financial Metrics · In this guide, we’ll walk you through the key financial metrics you should be capturing to grow your business today and eventually, satisfy

WHAT IS THE #1 CHALLENGE OF MARKETERS? Measuring ROI is … · DIFFERENT OBJECTIVES = DIFFERENT METRICS For Facebook Engagement, look at different things based on different objectives!

Running Engaging Online Events · 2020-04-13 · Define success metrics Metrics will vary across different types of events. A few to consider are: registrations, sessions watched,

Diagnostic Metrics Week 2 Video 3. Different Methods, Different Measures Today we’ll continue our focus on classifiers Later this week we’ll discuss

Supply We’ll add in Demand later. We’ll add in Demand later. Excited? Excited? I can tell. I can tell. It’s just like Demand only different. It’s just

USABILITY TEST PLAN€¦ · process. METRICS There are many different ways to categorize and rank the severity of errors in an Usability test. For this test we’ll refer to an adapted

Researcher impact & responsible use of metrics · weaknesses of using different scholarly metrics • Use metrics responsibly. Research Impact D. efined “Impact is usually demonstrated