Week 10 fraud copy

Preview:

Citation preview

Think:Bing It On!

Compares Bing to Google

How would you design this?Tell me:

Me?And I’m guessing:

Hypothesis: Students in Toronto do not prefer one SE to another.

How?100 Senecans will be surveyed by 10 paid

surveyors.Asked to compare two frames with fonts, colours and text sizes randomized.Search terms Senecans choose.Choose frame they like best: Google or BingResults not revealed to participants

Why?Identify sample and population I’m

trying to sample.Removing my bias by asking surveyorsSurveyors will not know how survey is

designed.“Double blind”

Why 100?10 is too few1000 is too many.For sufficiently large n, the distribution of will be closely approximated by a normal distribution with the same mean and

variance.[1] Using this approximation, it can be shown that around 95% of this distribution's probability lies within 2 standard deviations of the mean. Because of this, an interval of the form

will form a 95% confidence interval for the true proportion. If this interval needs to be no more than W units wide, the equation

can be solved for n, yielding[2][3] n = 4/W2 = 1/B2 where B is the error bound on the estimate, i.e., the estimate is usually given as within ± B. So, for B = 10% one requires n = 100, for B = 5% one needs n = 400, for B = 3% the requirement approximates to n = 1000, while for B = 1% a sample size of n = 10000 is required. These numbers are quoted often in news reports of opinion polls and other sample surveys.

“Sample Size Determination”

Say that works60 prefer Bing40 prefer Google

What does that mean?

I have no idea!Well, sort of.

60% (±5%, p=.05) prefer Bing to Google

You tell me, what does that mean?

Maybe nothing?Maybe something?

Look: that was as easy as it gets!

Population identification, sample size calculation, double blinding, within two standard deviations, after stripping CSS—all that before I do the statistics

Which I can’t understand!

Good methodology● Design your experiment before hand● Run the experiment according to design● Without peeking

– Or changing● Collect all data● Interpret all data● Make all data available● Analyze data according to good analysis principles.

DucklingsYou have no idea how to do this.

No idea.Neither do I.

QuestionsHow many people do you need to

survey?How do you test them?Double blind?Blind?What do you ask them?

You have to do this● It’s too easy to fool yourself

Let’s reviewPublish or perish?

Who perishes? And where do they publish?

JournalsWhat are the most prestigious

journals in the world?How do you know?

Impact factorNature

Proceedings of the National Academy of Science

Science

Physical Review Letters

Journal of the American Chemistry Society

Physics Review B

Journal of Biological Chemistry

Applied Physics Letters

New England Journal of Medicine

Cell

(Eigenfactor.org data for 2011, most recent available)

RoughlyNumber of in-citationsNumber of out-citations

But?Top-ranked are mostly medicine w.

some physicsNo computers in top 100

Bioinformatics: 68

Get publishedOr get fired.

Science, Nature, Cell, NEJM, JAMA

You get ‘tenure’—never fired, made for life.

● Japanese researcher in anaesthesiology– Worked in Canada too

● Published 212 papers in 20 years(about one a month)

(Hmmmmm).

Yoshitaka Fujii

You’ll never guessHe made them up.

● 172 are demonstrably false.

As an aside:● Retractions still need work:

– Of Fujii’s first ten articles on GS ● 4 was clearly retracted● 1 was less clearly retracted● 5 were not labelled as retracted

Jan Hendrik Schön● Nano-physics genius!

– Won $100,000 as best young scientist

– Published, at his best, one paper every eight days● Including in Science and Nature

–The very best journals in the world.

Now● He has 10 friends on Facebook.

– I’m one!Gave back his PhD.Disappeared

You’ll never guess● He made all of his data up.

– [Movie time! 35:00]

So?● What’s the problem?

● So they lied. Nobody died.● (Well, probably. Fujii was a doctor.)

As I see it● Money

– Millions of dollars● Reputation

– Bell Labs, universities, colleagues, students

● Work: Reid Chesterfield spent 5 years trying to replicate Schön’s work.

MohammadHis supervisor spent months trying to

replicate Schön’s work(That’s hundreds of thousands of dollars)

Another kind● Damages to the scientific enterprise:

– Science has to be open to catch cheaters

– But openess makes researchers look bad

Kinds of fraud● Fabrication● Falsification● Other

Fraud“Fabrication of data involves totally inventing a

data set, falsification refers to manipulation of equipment or changing data such that the research is not accurately represented in the research report.” (Stroebe, Pestmes and Spears)

Fabrication● Pretty clear—you make up the data.

Falsification● Changing or interpreting the data:

“There is no rigid mathematical definition of what constitutes an outlier; determining whether or not an observation is an outlier is ultimately a subjective exercise.”

Outliers● How do you deal with them?

– Bill Gates walks in the room● Median and mean income?

(How) Do you eliminate that variation?

Data picking

● Say you want to show that monkeys flip a coin to heads more often than humans. How do you do it?

● Not investigate. Show

● 1) Each flip 1 coin 100 times● 2) Each flip 10 coins 10 times● 3) Each flip 100 coins 1 time

Then...

● Re-design your experiment!

Then...

● Monkeys and humans each flipped 10 coins....

A ha.● This is (abuse) of methodology

– And why I keep saying it matters!

Google Scholar vs MAS

What does that tell you?

Google Scholar vs MAS● That GS has better searching than

MAS● Or that GS has worse searching than

MAS!

Check this out

0 10 20 30 40 50 600

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Column A

Linear (Column A)

ClearlyA strong trend:

Decreasing over xDespite what appear to be sinusoidal variations

One problem:Made the data with random numbers

– And a few tricks● No R value● Lighten points● Darken line● Compress y for sharpness● Regenerate data if necessary

AlsoChoose line of best fit:

Linear? Moving average? Exponential? Log?

Of course● That’s not nearly the only way!

– Repeat the whole experiment– Blinding– Survey design– Outlier elimination

● And so on.

So: It’s easyIt’s so, so easy to cheat!

Let’s do it:

Google vs BingSay you wanted to show that Bing >

Google.How would you?

Population is, er, everyone!

Sample 1000 in Seattle

Sample young white men in Seattle

Redo sample!

Remove double blind

Remove single blind

10 in a row for Google? Outlier!

Choose best 100 of 1000 in Seattle

Repeat that ‘experiment’ to find the 20 th out of 20.

Why?● Career pressure

– Publish or perish– Past glories

● Over confidence● Tempation because of irreproducibility

How do they get caught?● Data that is too good● Draw suspicion in publication● Ratted out by underlings

Lessons:● Don’t cheat well● Don’t cheat much

Recommended