The Cure: A Game with the Purpose of Gene Selection for Breast Cancer Survival Prediction

Preview:

DESCRIPTION

Keynote Presentation for Rocky Bioinformatics conference 2013. Its about http://genegames.org/cure/

Citation preview

Benjamin Good*, Salvatore Loguercio, Max Nanis, Andrew Su

The Scripps Research Institute

http://genegames.org/cure/

Rocky 2013

THE CURE: A GAME WITH THE PURPOSE OF GENE SELECTION FOR BREAST CANCER

SURVIVAL PREDICTION

A QUESTION

How would you get 150 PhD level scientists to work together on the same problem?

Without any money?

TRAIL MAP

Games Survival Prediction

The Cure

WHY GAMES?

It is estimated that 9 billion hours are spent playing Solitaire every year

Luis Von Ahn. : Google Tech Talk: Human Computation 2006. (Shortly after receiving $500,000 ‘Genius Grant’ for this work)

Seven million hours of human labor

Empire State Building

ONE YEAR SOLITAIRE = 1,285 EMPIRE STATE BUILDINGS

McGonigal J. Reality is broken : why games make us better and how they can change the world. New York: Penguin Press; 2011.

What if we could use a tiny fraction of that human effort to achieve another purpose?

empir

e stat

e build

ing

one y

ear o

f solita

ire

one y

ear o

f gam

es

7M 9B 150B

150 billion hours gaming each year

PURPOSES

Label all images on the Web

Find objects inside images

Teach computers English

Tag songs

Rate image quality

Computer science

Build ontologies

Tag Malaria parasites in blood smears

Map connections between neurons Align DNA and

protein sequences

Assemble genomes

Design RNA molecules

Figure out how proteins fold

Biology

Link genes with diseases

Develop better treatments for breast cancer

GAMES WITH A PURPOSE

The Cure

MOLT

TRAIL MAP

Games Survival Prediction

The Cure

10 year survival?

find patterns

INFERRING SURVIVAL PREDICTORS

No

van't Veer, Laura J., et al. "Gene expression profiling predicts clinical outcome of breast cancer.” Nature 415.6871 (2002): 530-536.

Yes make predictions on new samples

No

Yes

10 year survival?

find patterns make predictions

INFERRING SURVIVAL PREDICTORS

1) select genes

2) infer predictor from data (e.g. decision tree, SVM, etc.)

Out of the 25,000+ genes, which small set works together the best?

No

Yes

10 year survival?

PROBLEM: GENE SELECTION INSTABILITY

instability: different methods, different datasets produce different gene sets for the same phenotype [1]

[1] Griffith, Obi L., et al. "A robust prognostic signature for hormone-positive node-negative breast cancer." Genome Medicine 5.10 (2013).

PROBLEM: THE VALIDATION GAP

training data, test data

validation

validation: predictive signatures often perform worse on independent data created for validation.

Photograph by Richard Hallman, National Geographic Adventure Blog

find patterns

make predictions

ADDING PRIOR KNOWLEDGE TO THE DISCOVERY ALGORITHM

<10 yr survival

>10 yr survival

EX.) NETWORK GUIDED FORESTS

Use network to find good gene combinations

Dutkowski & Ideker (2011) Protein Networks as Logic Functions in Development in Development and Cancer. PLoS Computational Biology

BUT MOST KNOWLEDGE IS NOT STRUCTURED

2000200120022003200420052006200720082009201020112012

500000

550000

600000

650000

700000

750000

800000

850000

900000

950000

1000000

Number ar-ticles added to PubMed

112 publications/hour(37 more by the end of this talk)

>160,000 publications linked to “breast cancer” since 2000 http://tinyurl.com/brsince2000

HOW CAN WE USE UNSTRUCTURED KNOWLEDGE FOR GENE SELECTION?

Need an intelligent system that is good at reading and hypothesizing

Like you

TRAIL MAP

Games Survival Prediction

The Cure

THE CURE HTTP://GENEGAMES.ORG/CURE/

education level?

cancer knowledge?

biologist?

PLAY = GENE SELECTION

Alternate turns picking a gene from a “board” of 25

Your hand

Opponents hand

SCORING

Cure Server

Score reflects accuracy of decision tree created with just the selected genes on real training data

PLAY WITH KNOWLEDGE: GENE ONTOLOGY

PLAY WITH KNOWLEDGE: GENE RIFS

YOU WIN!

COMMUNITY BOARD VIEW, CHOOSE OPEN BOARD

You beat this one

The community finished this board (e.g. 11 different players completed it)

This board is still open

BOARDS

• 25 genes each

• randomly selected from 1,250 genes that passed an unsupervised filter for minimum expression level and variance for a particular dataset [1],[2]

• 4 different 100 board rounds completed, each with some overlap

• 3731 distinct genes used in the game

[1] Curtis, Christina, et al. "The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups." Nature (2012)[2] Griffith, Obi L., et al. "A robust prognostic signature for hormone-positive node-negative breast cancer." Genome Medicine (2013)

PLAYERS

Sep-12

Oct-12

Nov-12

Dec-12Ja

n-13

Feb-13

Mar-13

Apr-13

May-13

Jun-1

3Ju

l-13

Aug-13

0

50

100

150

200

250

OtherDid not statenoneBAMScMDPhD

New player registra-tions

Sep-12

Oct-12

Nov-12

Dec-12Ja

n-13

Feb-13

Mar-13

Apr-13

May-13

Jun-1

3Ju

l-13

Aug-13

00.05

0.10.15

0.20.25

0.30.35

0.4

%PhD

http://io9.com/these-cool-games-let-you-do-real-life-science-486173006

1,077 Players registered (one year)

Sage DREAM7 challenge, game announcement

PLAYER DEMOGRAPHICS

no ns yes0

100200300400500600700

Cancer knowl-edge?

no ns yes0

100200300400500600700800

Are you a Biologist?

graduate_degree

undergraduate

none

bachelors

master

s mdnon

e nsothe

rphd

050

100150200250300350

Most recent degree

GAMES PLAYED • 9,904 games (non training)

0 100 200 300 400 500 600 700 8001

10

100

1000

Total games played per player

Player

Total games played

PhD

0 5 10 15 20 250

100

200

300

400

500

600

700

800

games played, top 20 players

PhD

MD

MSPhD

GENE RANKINGS FROM GAMES

find patterns

make predictions

<10 yr survival

>10 yr survival

GENE RANKINGS FROM GAMES• For each gene:

1. O = number of times it appeared in a game (some genes occur on multiple boards, all boards are played multiple times, all occurrences are counted)

2. S = number of times it was selected by a player

3. F = S/0

• Games can be filtered based on player data

• We can estimate an empirical P value for each value of O, S

• P reflects the chances of getting S or more by chance given O

Examples (all games):

• B-cell lymphoma 2 gene:

O = 13, S = 10, F = 10/13 = 0.77, P < 0.0001

• Alanine and arginine rich domain containing protein:

O = 33, S = 3, F = 3/33 = 0.09, P = 0.91

GENES SELECTED BY ALL PLAYERS9904 GAMESP<0.001, 60 GENES

Top 10 enriched disease annotations n genes

adj. P < 2.43e-06background = 3731 genes used in any game

Top 10 genes

Wang, Jing, et al. "WEB-based GEne SeT AnaLysis Toolkit (WebGestalt): update 2013." Nucleic acids research (2013).

GENES SELECTED BY PEOPLE: WITH PHDS WITH KNOWLEDGE OF CANCER,

2373 GAMES P<0.001, 82 GENES

Top 10 genes

Top 10 enriched disease annotations n genes

adj. P < 5.76e-08

“Expert Gene Set”

GENES SELECTED BY PEOPLE: WITHOUT PHDS, WITH NO KNOWLEDGE OF CANCER, THAT ARE NOT BIOLOGISTS

3607 GAMESP<0.001 , 10 GENES

• Gene set not significantly enriched with any disease annotations

Top 10 genes

SELF REPORTING SEEMED TO WORK...

EVEN WITHOUT FILTERING, THE DATA CONTAINS THE KNOWLEDGE• “All Players” still contained significant cancer signal.

PROBLEM: GENE SELECTION INSTABILITY

instability: different methods, different datasets produce different gene sets for the same phenotype [1]

[1] Griffith, Obi L., et al. "A robust prognostic signature for hormone-positive node-negative breast cancer." Genome Medicine 5.10 (2013).

GENE SET OVERLAPS, SOME BUT NOT MUCH

http://bioinformatics.psb.ugent.be/webtools/Venn/

“Expert Gene Set”

PROBLEM: THE VALIDATION GAP

training data, test data

validation

validation: predictive signatures often perform worse on independent data created for validation.

Photograph by Richard Hallman, National Geographic Adventure Blog

CLASSIFIER PERFORMANCE WITH DIFFERENT GENE GROUPS, DIFFERENT DATASETS

X-axis Test Set performance Griffith 2013 data

Y-axis Test Set performanceMetabric training Oslo Test

Only difference between points, are the genes used to build SVM classifier

10 year survivalYes

No

“Expert Gene Set”

SUMMARYPlusses

• 1 year

• 1,000 players, 150 PhDs

• 10,000 games

• “expert knowledge” captured through an open game

• New gene ranking method with results competitive with established approaches

• Game is now in use in an undergraduate class

Minuses

• Did not make a significantly better breast cancer survival predictor

• Game could have been better in many ways

• no beginning, middle or end

• random guessing can win

• easy to cheat

NEXT STEPS • More fun

• More learning for novices

• More control for experts

• More data

THE END

More information at:http://genegames.org/cure/bgood@scripps.edu@bgood

Thanks to:

Players!!!!Andrew SuSalvatore LoguercioMax NanisKarthik Gangavarapu

We are hiring! Looking for postdocs, programmers interested in crowdsourcing and bioinformatics. Contact: asu@scripps.eduFunding

GAMES WITH A PURPOSE

The Cure

MOLT

Loguercio, Salvatore, et al. "Dizeez: an online game for human gene-disease annotation." PloS One (2013)

Khatib, Firas, et al. "Algorithm discovery by protein folding game players." Proceedings of the National Academy of Sciences (2011)

of collecting expert level knowledge

HUMAN GUIDED FOREST (HGF)

http://i9606.blogspot.com/2012/04/human-guided-forests-hgf.html

Let CURE players build decision modules

WHY DID YOU SIGN UP? (83 RESPONSES)

To help breast cancer research

To learn something To have fun playing a game0.0%

10.0%

20.0%

30.0%

40.0%

50.0%

60.0%

70.0%

80.0%

90.0%

Why did you sign up for The Cure? (select all that apply)

WAS THE GAME FUN?

Yes, it was very fun A little bit entertaining No, not at all0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

perc

ent

DO YOU KNOW ANYONE THAT HAS OR HAD BREAST CANCER?

Have you known or do you currently know anyone that has or has had breast cancer?

YesNo

DID YOU LEARN ANYTHING FROM PLAYING?

Yes, I felt like I learned a lot Yes, I learned a little bit No, I did not learn anything0

10

20

30

40

50

60

MY KNOWLEDGE OF BREAST CANCER IS:

I am an

expe

rt in b

reast c

ancer

I have

helpe

d con

duct c

ancer

resea

rch ias

part o

f my jo

b

I know

some b

iology

and h

ave so

me und

erstan

ding o

f wha

t cance

r is

I know

a littl

e biolo

gy, bu

t noth

ing sp

ecific

to can

cer

Nothing

, I do

not kn

ow a

thing a

bout

it0

0.1

0.2

0.3

0.4

0.5

0.6

AGE?

Which category below includes your age?

17 or younger18-2021-2930-3940-4950-5960 and above

GENDER?

What is your gender?

FemaleMale

TRAINING LEVELS

the decision tree created using the feature “makes milk” is 100% correct on training data, you win!

TRAINING INTERFACE

Choose the feature that best distinguishes mammals from other creatures

TRAINING INTERFACE

the decision tree created using the feature “has hair” is 94% correct on training data, you win!

OVERLAP OF SIGNIFICANT GENE SETS FROM DIFFERENT CURE GAME FILTERS

No Expertise (3,607 games)PhD & Cancer Knowledge (2,373 games)

Biologist (4,913 games)

PhD or MD (3,070 games)

Cancer Knowledge (4,660 games)

MOST RANDOM GENE EXPRESSION SIGNATURES ARE SIGNIFICANTLY ASSOCIATED WITH BREAST CANCER OUTCOME

Venet et al.(2011). PLoS Comp. Bio.

Still need to pick gene setsFeature selection challenge still relevant Very useful grain of salt in interpreting these results..

Recommended