16
Interference in text categorization experiments Giorgio di Nunzio University of Padua Peter Bruza and Laurianne Sitbon Queensland University of Technology EU Marie Curie FP7 project “QONTEXT”

Interference in text categorization experiments Giorgio di Nunzio University of Padua Peter Bruza and Laurianne Sitbon Queensland University of Technology

Embed Size (px)

Citation preview

Page 1: Interference in text categorization experiments Giorgio di Nunzio University of Padua Peter Bruza and Laurianne Sitbon Queensland University of Technology

Interference in text categorization experiments

Giorgio di Nunzio

University of Padua

Peter Bruza and Laurianne Sitbon

Queensland University of Technology

EU Marie Curie FP7 project “QONTEXT”

Page 2: Interference in text categorization experiments Giorgio di Nunzio University of Padua Peter Bruza and Laurianne Sitbon Queensland University of Technology

© Jerome Busemeyer

Page 3: Interference in text categorization experiments Giorgio di Nunzio University of Padua Peter Bruza and Laurianne Sitbon Queensland University of Technology

The Experimental Conditions

• Condition 1: Make an action decision without 

reporting any categorization

• Condition 2: Make an action decision after

categorizing a face

Categorization—Decision Experiment

© Jerome Busemeyer

Page 4: Interference in text categorization experiments Giorgio di Nunzio University of Padua Peter Bruza and Laurianne Sitbon Queensland University of Technology

© Jerome Busemeyer

Page 5: Interference in text categorization experiments Giorgio di Nunzio University of Padua Peter Bruza and Laurianne Sitbon Queensland University of Technology

Decision making under uncertainty => incompatible perspectives (subspaces)

Incompatible subspaces => violation of the law of total probability

Page 6: Interference in text categorization experiments Giorgio di Nunzio University of Padua Peter Bruza and Laurianne Sitbon Queensland University of Technology

The port of Philadelphia was closed when a Cypriot oil tanker “Seapride” ran aground after hitting a 200-foot tower supporting power lines across the river; a Coast Guard spokesman said. He said there was no oil spill but the ship is lodged on rocks opposite the Hope Creek nuclear power plant in New Jersey. He said the port would be closed until today when they hoped to refloat the ship on the high tide. After delivering oil to a refinery in Paulsboro; New Jersey; the ship apparently lost its steering and hit the power transmission line carrying power from the nuclear plant to the state of Delaware.

Is it about “crude oil”, “shipping”, “shipping” BUT NOT “crude oil”?

PHILADELPHIA PORT CLOSED BY TANKER CRASH

Page 7: Interference in text categorization experiments Giorgio di Nunzio University of Padua Peter Bruza and Laurianne Sitbon Queensland University of Technology

Are topical subspaces incompatible in some cases?

… we conducted an experiment to find out…

Page 8: Interference in text categorization experiments Giorgio di Nunzio University of Padua Peter Bruza and Laurianne Sitbon Queensland University of Technology

Design: One vs. two step topical classification

d: doc

S: “shipping”

Ŝ: not “shipping”

C: “crude”

Ĉ: not “crude”

Page 9: Interference in text categorization experiments Giorgio di Nunzio University of Padua Peter Bruza and Laurianne Sitbon Queensland University of Technology

Materials

Documents drawn from the Reuters-21578 collection

Manually classified set of Reuters newsfeeds (1988) by a group of experts (72 categories)

Page 10: Interference in text categorization experiments Giorgio di Nunzio University of Padua Peter Bruza and Laurianne Sitbon Queensland University of Technology

Subjects (crowdsourcing)

HIT: “Human Intelligence task” on Mechanical Turk (Amazon)

The higher the number of HITS, the higher the expertise – “masters” have “demonstrated excellence”

Each document categorized by ~10 subjects (workers)

Quality check used to remove unreliable observations

Page 11: Interference in text categorization experiments Giorgio di Nunzio University of Padua Peter Bruza and Laurianne Sitbon Queensland University of Technology

Results

Differences in P(c2) – law of total probability is being violatedOriginal Reuters classification seems to fit a 2 stage decision model. (In ML, categorization decisions are assumed to be in isolation i.e., a one stage model)

c1 = crudec2 = shipping

Page 12: Interference in text categorization experiments Giorgio di Nunzio University of Padua Peter Bruza and Laurianne Sitbon Queensland University of Technology

"Closer examination of the results, however, shows that the kind of errors made are quite different. Human errors stem mainly from inconsistent application of categories, especially the categories with the vaguest definitions, and from failing to specify all the categories when several should have been assigned to a story.”(Hayes and Weinstein 1990)

Did each expert had in mind a specific order of the 72 categories? Maybe the same order in which they were given (alphabetical order? subject area?). If this is the case, they were actually performing a sort of n-step classification (n<=6).

Page 13: Interference in text categorization experiments Giorgio di Nunzio University of Padua Peter Bruza and Laurianne Sitbon Queensland University of Technology

Concluding remarks

No theoretical new insights

Some evidence for incompatiblity in topical subspaces (a la (Busemeyer, Wang & Lambert-Mogiliansky, 2009), but more extensive studies needed

Page 14: Interference in text categorization experiments Giorgio di Nunzio University of Padua Peter Bruza and Laurianne Sitbon Queensland University of Technology

Is it really incompatibility, or just two probability spaces?

… not particularly surprising

“Kolmogorov realized that different sample spaces are needed for different experiments, but his theory does not provide a coherent principle for relating these separate experiments. This is exactly what quantum probability theory is designed to do” (Busemeyer & Bruza 2012).

Page 15: Interference in text categorization experiments Giorgio di Nunzio University of Padua Peter Bruza and Laurianne Sitbon Queensland University of Technology

Impact on Bayesian classifiers (machine learning)

We know the difficult categories by the performance of automated classifiers

Compute the interference term for such categories

Augment the Bayesian classifier with the interference term

Page 16: Interference in text categorization experiments Giorgio di Nunzio University of Padua Peter Bruza and Laurianne Sitbon Queensland University of Technology

Further work

(a) Incompatible subspaces in the Linda problem(b) Incompatible subspaces in document relevance(c) Incompatibility between dimensions of relevance (topicality

vs sentiment)

We are trying to come up with models of users – by the decsions they make about information