Final Viper Prototype Presentation

VISUALISING ERRORS IN

ANIMAL PEDIGREE

GENOTYPE DATA

Martin Graham, Jessie Kennedy, Trevor Paterson & Andy Law

Edinburgh Napier University & The Roslin Institute, Univ of Edinburgh, UK

2 years ago at Firbush...

I said: “Aim is to develop interactive tools to locate and isolate errors in pedigree genotype

data in their datasets”

Where a

Pedigree = Family tree of related animals

Genotype = Genetic makeup of an organism

Humans have DNA

They in fact have 2 lots of DNA

(diploidy), which may or may not match at

certain points

Two lots of DNA bundled in a

chromosome

When two parents produce offspring, one lot of

DNA is passed onto the child from each parent

Which lot is used changes just to shuffle things up

a bit more

Inheritance Basics (Very)

By looking at many, many Single Nucleotide

Polymorphisms markers (points where we

know things vary between individuals at the

level of single DNA letters) we can check for

errors

If one letter from each parent at these points

turns up in the same place in the child’s DNA

everything is good

Inheritance Basics (Very)

A G A C A C

Errorz

But inevitably....

Errors creep in for various

reasons, bad record-

keeping, observations...

Muddled DNA

sampling, animals “jumping

the fence” etc etc

Unusable data in this state

A G C C C C

A G C A G G

A G C A T A

Nothing inherited from mum

Nothing inherited from dad

Novel allele. No inheritance

from one parent, but we

can’t tell which...

Thus

There is a constant need to clean up pedigree

data

Roslin have a tool that views data as a table

(markers by individuals), so pedigree-based

patterns to error, such as the wrong dad for an

entire set of offspring, were very hard to spot

So they wanted a new tool, with a funky

visualisation pedigree layout

Layouts

So (2 years ago) we looked at pedigree

layouts

And they were all rubbish

Layouts

Didn’t scale, became intractable to follow relationships, couldn’t

resolve generations, often only individual-out views rather than

whole pedigree etc

Layouts

So we developed what we called the sandwich

view. Between neighbouring generations, we

draw

Dads as the top slice of bread

Mums as the bottom slice of bread

Kids as the filling

Errors colour-coded across the marker set, more

errors = deeper colour

Layouts

Each family forms a block between the

respective mum and dad, making it easy to

see who is who’s offspring/parents

Layout works as males mate with multiple

females in each generation but the opposite is

rare

Layouts

Each child forms a glyph used to

show error

Divided into three parts

Up triangle coloured if error with dad

Down triangle coloured if error with

mum

Middle band coloured if error, but

parent in error is unknown (novel

allele)

Lo, pedigree-based error patterns

revealed themselves

Layouts

Tables full of data and histograms to show

error distribution by marker and individuals

also help

Cleaning

So, we can show errors nicely

But the aim is to get rid of all these errors

Masking is when we pretend we don’t know

the values for particular markers / individuals /

combinations thereof

What happens then is that those values are

inferred from the corresponding values in the

parents

A G C C C C

? ? C C C C

A G G C

Cleaning

The visualisations lets the biologist mask

individuals / bunches of markers / individual

genotype points / relationships

These are then shown in blue in the interface

Cleaning

This last point’s important as pedigree errors just propagate down the pedigree. A wrong parent for a child can’t be cured by hiding the child

It’s also why we cant clean these data sets automatically, the biologists judgement in what causes an error is crucial

The Goal

Eventually we want a display with no nasty red

colours and then we can save it as a “clean”

data set

Though obviously with lots of missing data

But the biologists say their tools can handle

missing things, but wrong things blow them up

And we did have to stick in a final “auto clean up”

button to fix sporadic errors that would have taken

ages to fix manually

But the major systematic errors are fixed by the

biologist

User Test

We did a user test with 11 biologists at Roslin

They preferred the new tool to the table-like

tool

Probably the most interesting thing past the

numbers was once again how much a bunch

of scientists are in thrall to Excel

Just like the taxonomists we’ve worked with /

social scientists we’re writing a proposal with

Which is why the Roslin guys made a table-a-like

tool in the first place to try and appease them

Conclusion

Built successful tool (got it published in

EuroVis, BioVis and AVI)

Whether it’s successful from the biologists

point of view...

During the project, marker set sizes jumped from

thousands to hundreds of thousands

Sequencing the data used to be the costly part of

the process, staff time to clean it up was relatively

cheap

Biology in general is having a data crisis, some

opinions say its cheaper/easier to redo

experiments than store the TBs of information

they produce in the first place

Conclusion

Available at www.viper-project.org

Did do JavaDocs this time

I enjoyed it

http://www.viper-project.org/



Education

Final Viper Prototype Presentation