35
Data Mining, Truth, Justice, the American Way, and the Flying Spaghetti Monster [email protected] Ph.D. LCSEE, WVU, 20 Sept 2007

Data mining, truth, justice, the American Way, and the Giant Spaghetti Monster

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Data mining, truth, justice, the American Way, and the Giant Spaghetti Monster

Data Mining, Truth, Justice, the American Way,and the Flying Spaghetti Monster

[email protected] Ph.D.LCSEE, WVU, 20 Sept 2007

Page 2: Data mining, truth, justice, the American Way, and the Giant Spaghetti Monster

2

Expose, and hose

• "Part of education is toexpose people to differentschools of thought.”

- President George Bush,August 1, 2005

• "Part of science is toexpose people to thecritical and continual(re)evaluation of ideas.”- Some guy called Timm,

September 20, 2007

Page 3: Data mining, truth, justice, the American Way, and the Giant Spaghetti Monster

3

"Look up in the sky! It's a bird! It's aplane! It's Superman!"

"Yes, it's Superman, strange visitor fromanother planet who came to Earth withpowers and abilities far beyond those ofmortal men.”

“Superman, who can change the course ofmighty rivers, bend steel in his bare hands;and who, disguised as Clark Kent, mild-mannered reporter for a great metropolitannewspaper, fights a never ending battle fortruth, justice, and the American way."

Why a never-ending battle? How to ensure

justice?How to find truth? How to make lottsa $$ ?

Page 4: Data mining, truth, justice, the American Way, and the Giant Spaghetti Monster

4

So, tonight Notions of certainty

Standards for debate Surprises

Nothing is “truth” but many more things are false

And some things are useful Implications for humility

And for justice

Page 5: Data mining, truth, justice, the American Way, and the Giant Spaghetti Monster

5

God gave me a brain. I take it (s)he wants me to use it.

Mark of the rational while not dead; do

Review and revise assumptions; Done

Entertain a wide range of ideas But don’t necessarily accept them

Demand evidence that lets your repeat/ refute/ improve

prior conclusions

But what of faith? That, is another talk There is room for the

divine in my universe But in my test tubes?

Not too much

Page 6: Data mining, truth, justice, the American Way, and the Giant Spaghetti Monster

6

Data miners: agents that automate thecreation and review of new ideas

@relation weather.symbolic@attribute outlook {sunny, overcast, rainy}@attribute temperature {hot, mild, cool}@attribute humidity {high, normal}@attribute windy {TRUE, FALSE}@attribute play {yes, no}@data

sunny,hot,high,FALSE,nosunny,hot,high,TRUE,noovercast,hot,high,FALSE,yesrainy,mild,high,FALSE,yesrainy,cool,normal,FALSE,yesrainy,cool,normal,TRUE,noovercast,cool,normal,TRUE,yessunny,mild,high,FALSE,nosunny,cool,normal,FALSE,yesrainy,mild,normal,FALSE,yessunny,mild,normal,TRUE,yesovercast,mild,high,TRUE,yesovercast,hot,normal,FALSE,yesrainy,mild,high,TRUE,no

outlook = sunny | humidity = high: no | humidity = normal: yes outlook = overcast: yes outlook = rainy | windy = TRUE: no | windy = FALSE: yes

Mountains of data

Tablespoons ofknowledge

Page 7: Data mining, truth, justice, the American Way, and the Giant Spaghetti Monster

7

Data doubling every 20 months Internet, Radio Frequency Identification (RFID) tracking, on-line

shopping (patterns of sales tracked at Amazon)

So now we can automatically learn answers to many questions; e.g. What eggs to select for IVF? What will software cost to develop? What diseases does a patient have? Which loan applications to fund? What houses will have the best resale value? Which parts of the program need more inspection? What products are best to sell to what markets? What cows to keep and which to send to the abattoir ? How to teach a satellite to distinguish between cloud shadows and oil

spills? How much electricity will be needed in two hours

i.e. what cola-powered generators to fire up?

Page 8: Data mining, truth, justice, the American Way, and the Giant Spaghetti Monster

8

More fundamentally, what can we sayabout the world, with any certainty?

Same data, different data miners different conclusions

Every miner biased by Evaluation bias Language

What is the “shape” of themodels we can learn?

Decision trees, equations, etc Search

Pruning the possible infinitespace of of candidate models

What not to explore Over-fitting avoidance

How to stop the learner fixating on noise E.g. pruning back decision trees

Page 9: Data mining, truth, justice, the American Way, and the Giant Spaghetti Monster

9

• Bias lets us ignore “stuff”.

• Without it, we don’t knowwhat is important or dull, wecan’t summarize, generalize.

• Without bias, we can’tlearn from the past

• Bias blinds us butlets us see the future

• But changing biases changes whatwe best believe

• No wonder truth is anever-ending battle

Any learning schemehas many biases

Page 10: Data mining, truth, justice, the American Way, and the Giant Spaghetti Monster

10

Generalizing fromthe past, works

Sometimes, very clearly Heavy smokers have

2000% to 3000%higher change of lungcancer

Learned theoriesperforms very well onnew data

But ... the “best” learned theory

can be a moveable feast.

Page 11: Data mining, truth, justice, the American Way, and the Giant Spaghetti Monster

11

So, a relativistic soup?

No certainty? No way to plan effective actions? No way to rule out absurd notions?

Page 12: Data mining, truth, justice, the American Way, and the Giant Spaghetti Monster

12

I don’t want to offendany one, but…

… I think that once … there were no cell phones or iPods, or clothes, or

countries, or language, orhuman society, or 4-valvedhearts, or homeostasis, ororgans, or brains, or planets,or stars, or matter

Where the net energyin-flow is positive… the universe selects for self-

perpetuating systems, an exponentially decreasing

number of which are ofexponentially increasingcomplexity

Should I even say this in apublic place? "Part of education is to expose

people to different schools ofthought.”

President George Bush,August 1, 2005

Shouldn’t I be have to givecredence to all theories?

Evolution, Intelligent design Pirates cause global

warming?

Page 13: Data mining, truth, justice, the American Way, and the Giant Spaghetti Monster

13

The Church of the FlyingSpaghetti Monster (FSM)

Founded in 2005 OSU physics graduate Bobby Henderson

A protest against the decision by the Kansas State Board of Education That require the teaching of intelligent design as an alternative to biological evolution.

Henderson wrote to the board professing belief in a supernatural Creator called the Flying Spaghetti Monster Demanded that his "Pastafarian" theory of creation be taught in science classrooms.

Page 14: Data mining, truth, justice, the American Way, and the Giant Spaghetti Monster

14

FSM is not about religion

It is a mistake to view FSM as anti-religion Rather, FSM is anti-anti-scientific rigor

No one in their right mind would everbelieve this nonsense And that’s the point

Truth is a never-ending battle We must have standards to assess scientific

theories, to reject absurdities Or any nonsense can be released on this world

E.g. “Global warming is caused by pirates.”

Page 15: Data mining, truth, justice, the American Way, and the Giant Spaghetti Monster

15

Wikipedia on FSM FSM: an invisible, undetectable

Flying Spaghetti Monster

Evidence for evolution planted byFSM to in to Pastafarians' faith

FSM changes the results ofmeasurements, like radiocarbondating, via His Noodly Appendage.

Heaven contains beer volcanoesand a stripper factory.

Hell is similar, but with stale beerand diseased strippers.

Pirates are "absolute divinebeings" and the originalPastafarians.

Their image as "thieves andoutcasts" is misinformation spreadby Christian theologians in theMiddle Ages and Hare Krishnas.

Pirates are "peace-lovingexplorers and spreaders of goodwill" who distributed candy tosmall children.

Global warming, earthquakes,hurricanes, and other naturaldisasters are a direct effect of theshrinking numbers of pirates sincethe 1800s.

Page 16: Data mining, truth, justice, the American Way, and the Giant Spaghetti Monster

16

FSM “proof” of the divinity of pirates

X-axis deliberatelymisleading.

A case study on hownot to present data

Crazy? Yes! • But would you recognize such craziness if you say it again?

Page 17: Data mining, truth, justice, the American Way, and the Giant Spaghetti Monster

17

What is the “best” weight-loss diet?

Page 18: Data mining, truth, justice, the American Way, and the Giant Spaghetti Monster

How lucky for those in powerthat people don't think.

- Adolph Hitler

i.e. people trying tosell you their diet book

Page 19: Data mining, truth, justice, the American Way, and the Giant Spaghetti Monster

19

What is the “best”programming language?

Page 20: Data mining, truth, justice, the American Way, and the Giant Spaghetti Monster

20

To our peril, we trustold ideas too much

Columbia ice strike: Size: 1200 in3, Speed: 477 mph

(relative to vehicle)

Certified as “safe” by theCRATER micro-meteorite model A typical experiment in

CRATER’s test database Size: 3 in3 piece of debris Speed: under 150 mph.

Page 21: Data mining, truth, justice, the American Way, and the Giant Spaghetti Monster

21

Value of estrogen

(NYT magazine,Sept 16, 2007)

1990s: American Heart Association

recommends hormone replacementtherapy for older women to ward offheart disease and osteoporosis.

2001: 15 million Americans filling H.R.T.

prescriptions annually 2002:

estrogen therapy exposed as a hazard,not a benefit, for health

Failure of scientific method Benefits of estrogen reported from large

observational studies, not randomized trials Repeated epidemiological finding:

randomized trail rarely support conclusionsfrom observational studies.

So forget what you’re read about Anti-oxidants like vitamins E & C &beta

carotene preventing heat disease Fiber prevents colon cancer

Page 22: Data mining, truth, justice, the American Way, and the Giant Spaghetti Monster

22

So, why is FSM silly? And please, rest assured,

it is very very silly stuff indeed.

Theories need an entrance exam

Many possible theories one for each bias

Demand that a theory has past at leastsome operational al test before wecondone it, act on it. If no reason to accept the new, don’t

Trust the most what has beenchallenged the most Karl Popper

Page 23: Data mining, truth, justice, the American Way, and the Giant Spaghetti Monster

23

No things are “right”, but somethings are “useful”

Sure, one data set supports many theories. But there are many many more theories that are

unsupported. No model is right, but some things are useful

(perform well on test data) George Box

And many many many more ideas are useless Can’t make predictions Not defined enough to support (possible) refutation

Page 24: Data mining, truth, justice, the American Way, and the Giant Spaghetti Monster

24

Wolfgang Pauli The "conscience of physics",

the critic to whom his colleagues were accountable. Scathing in his dismissal of poor theories

often labeling it ganz falsch, utterly false. But “ganz falsch” was not his most severe

criticism, He hated theories so unclearly presented as to be

untestable unevaluatable,

Worse than wrong because they could not beproven wrong.

Not properly belonging within the realm of science, even though posing as such.

Famously, he wrote of of such unclear paper: ”This paper is right. It is not even wrong."

Page 25: Data mining, truth, justice, the American Way, and the Giant Spaghetti Monster

Believe those who seek the truth;doubt those who find it

-Andre Gide.

Page 26: Data mining, truth, justice, the American Way, and the Giant Spaghetti Monster

26

Don’t test once on just the training data

Study more than theaverageperformance

Also look at thevariance

E.g. here, nosignificant on newdata after X=8

Page 27: Data mining, truth, justice, the American Way, and the Giant Spaghetti Monster

27

If something works, poke it till it breaksi) Sort attributes on “infogain”ii) Learn using first N attributes

diabetes

labor soybean

anneal

A few variables A few variables are (often) enoughare (often) enough

Page 28: Data mining, truth, justice, the American Way, and the Giant Spaghetti Monster

28

Living with Uncertainty Check how training rate size effects theory

Page 29: Data mining, truth, justice, the American Way, and the Giant Spaghetti Monster

29

Living with Uncertainty Launch learners with anomaly

detection and repair tools

Page 30: Data mining, truth, justice, the American Way, and the Giant Spaghetti Monster

30

Living with uncertainty:count, alert, fix

Count: stuff seen in pastAlert: if new counts differentFix: find delta new to old Very, very fast

An incrementaldiscretizer + a Bayesclassifier where all inputsare all mono-classified

Track average maxlikelihood for dataprocessing in “era”’s of Xinstances

Contrast set learning

Linear time inference,Tiny memory footprint

And, it works [Orrego, 2004] F15 simulator data [courtesy B. Cukic] Five flights: a,b,c,d,e each with different off-nominal condition

imposed at “time” 15 Off-nominal condition not present in prior data In all cases,

massive change detected

Page 31: Data mining, truth, justice, the American Way, and the Giant Spaghetti Monster

31

Living with uncertainty Policy #1: exploration

Tolerate the sub-optimal, a little Doing crazy things to learn new things

Policy #2: exploitation Fix your theories and base your work on those fixed ideas.

Human young:

• Do crazy things (take long trips)

• Less craziness as we grow older

Popper:

• most “science” is puzzle solving…

• … within existing paradigms.

• Sometimes the paradigm breakdowns….

• …prompting revolutionary research

Life is a balancebetween

Page 32: Data mining, truth, justice, the American Way, and the Giant Spaghetti Monster

32

Tolerance of “exploration” Critical to the

American way America: history of

tolerance and acceptance

1945: 400 German rocket

scientists choose tosurrender to the Yankees,not the Russians

The choose their post-warlife based on theirperceptions of Americanideology

Hence,

Page 33: Data mining, truth, justice, the American Way, and the Giant Spaghetti Monster

33

Tolerance = hi-tech = $$$ R. Florida: The Economic

Geography of Talent, 2002 Annals of Association of American

Geographers 92(4), 2002,pp743-655

Best predictor for hi-tech industry R2 0.42 to “coolness” R2 0.49 to cultural amenities R2 0.50 to median house value R2 0.77 to “diversity” index

Page 34: Data mining, truth, justice, the American Way, and the Giant Spaghetti Monster

34

Data Mining, Truth, Justice, theAmerican Way & Flying Spaghetti Monsters

“Superman, fights a never ending battlefor truth, justice, and the American way."

Old conclusions must be constantly re-assessed

No “truth”,all Is biased.

A healthy hi-tech needstolerance to supportexploration

and that the FSM is silly,but would consider revising

that view if new evidenceemerges

To make $$, institutionalize

exploration and tolerance

Page 35: Data mining, truth, justice, the American Way, and the Giant Spaghetti Monster

35

Expose, and hose

• "Part of education is toexpose people to differentschools of thought.”

- President George Bush,August 1, 2005

• "Part of science is toexpose people to thecritical and continual(re)evaluation of ideas.”- Some guy called Timm,

September 20, 2007