Benchmarks and performance measures in artificial intelligence · Benchmarks and performance measures in arti cial intelligence Anders Jonsson ... Image classi cation Anders Jonsson

Benchmarks and performance measures inartificial intelligence

Anders JonssonArtificial Intelligence and Machine Learning Group

Universitat Pompeu Fabra

HUMAINT workshop6 March 2018

Anders Jonsson

Benchmarks in AI

Sequential decision making

Perform a sequence of actions to achieve a given objectiveEach decision has an impact on future decisions!

Anders Jonsson

Benchmarks in AI

Sequential decision making

Reinforcement learning:

Effect of actions initially unknown

Intervention: perform actions to test hypotheses about them

Aim: in a given state s, estimate the value Q(s, a) of anaction a and/or a policy π(·|s) for action selection

AI planning:

System has a model of the actions

Aim: compute a sequence of actions in advance

Anders Jonsson

Benchmarks in AI

Evaluation criteria

Theoretical analysis:

Performance bounds: how far from optimal is an AI algorithm?

Time complexity: how fast is it?

Memory complexity: how much memory does it use?

Empirical evaluation:

How does an AI algorithm perform in practice?

Anders Jonsson

Benchmarks in AI

Empirical performance measures

What do we measure?

Winning

Scoring points

?

What do we compare to?

Optimal or near-optimal

Human

Other algorithms

?

Anders Jonsson

Benchmarks in AI

Problems with empirical evaluation

Strong incentive to boost performance of ones own algorithm

Practices in reinforcement learning [Henderson et al. 2017]:

Run X trials, report average of 3 best runsOmit network architecture, random seeds, hyperparametersImplement own version of other researchers’ algorithms

Claims of human-level performance

Anders Jonsson

Benchmarks in AI

Benchmarks

Set of instances that are representative of problem difficulty

More unbiased comparison

Difficult to artificially boost the performance of an algorithm

More likely that results generalize

Anders Jonsson

Benchmarks in AI

Benchmarks

Atari

Open AI Gym

Project Malmo

Anders Jonsson

Benchmarks in AI

Competitions

International Planning Competition

General Video Game AI Competition

AIIDE StarCraft AI Competition

Anders Jonsson

Benchmarks in AI

Image classification

Anders Jonsson

Benchmarks in AI

Problems with benchmarks

Might not accurately reflect real-world problems

Might require large amounts of computational power

Excessive focus on winning leads to algorithms that do notreally advance the state-of-the-art (e.g. portfolio algorithms)

Anders Jonsson

Benchmarks in AI

AI in the real world

Clear and relevant performance criteria

Appropriate, publicly available benchmarks

Proper statistical comparisons

Independent verification and reproduction

Anders Jonsson

Benchmarks in AI

Lifelong learning

System that operates for long periods of time

Task is not fixed but changes, some tasks initially unknown

New objects and actions become available over time

Anders Jonsson

Benchmarks in AI

Questions

Anders Jonsson

Benchmarks in AI

Documents

Benchmarks and performance measures in artificial intelligence · Benchmarks and performance measures in arti cial intelligence Anders Jonsson ... Image classi cation Anders Jonsson