Upload
martinjgraham
View
1.671
Download
0
Embed Size (px)
Citation preview
VISUALISING ERRORS IN
ANIMAL PEDIGREE
GENOTYPE DATA
Martin Graham, Jessie Kennedy, Trevor Paterson & Andy Law
Edinburgh Napier University & The Roslin Institute, Univ of Edinburgh, UK
2 years ago at Firbush...
I said: “Aim is to develop interactive tools to locate and isolate errors in pedigree genotype
data in their datasets”
Where a
Pedigree = Family tree of related animals
Genotype = Genetic makeup of an organism
Humans have DNA
They in fact have 2 lots of DNA
(diploidy), which may or may not match at
certain points
Two lots of DNA bundled in a
chromosome
When two parents produce offspring, one lot of
DNA is passed onto the child from each parent
Which lot is used changes just to shuffle things up
a bit more
Inheritance Basics (Very)
By looking at many, many Single Nucleotide
Polymorphisms markers (points where we
know things vary between individuals at the
level of single DNA letters) we can check for
errors
If one letter from each parent at these points
turns up in the same place in the child’s DNA
everything is good
Inheritance Basics (Very)
A G A C A C
Errorz
But inevitably....
Errors creep in for various
reasons, bad record-
keeping, observations...
Muddled DNA
sampling, animals “jumping
the fence” etc etc
Unusable data in this state
A G C C C C
A G C A G G
A G C A T A
Nothing inherited from mum
Nothing inherited from dad
Novel allele. No inheritance
from one parent, but we
can’t tell which...
Thus
There is a constant need to clean up pedigree
data
Roslin have a tool that views data as a table
(markers by individuals), so pedigree-based
patterns to error, such as the wrong dad for an
entire set of offspring, were very hard to spot
So they wanted a new tool, with a funky
visualisation pedigree layout
Layouts
So (2 years ago) we looked at pedigree
layouts
And they were all rubbish
Layouts
Didn’t scale, became intractable to follow relationships, couldn’t
resolve generations, often only individual-out views rather than
whole pedigree etc
Layouts
So we developed what we called the sandwich
view. Between neighbouring generations, we
draw
Dads as the top slice of bread
Mums as the bottom slice of bread
Kids as the filling
Errors colour-coded across the marker set, more
errors = deeper colour
Layouts
Each family forms a block between the
respective mum and dad, making it easy to
see who is who’s offspring/parents
Layout works as males mate with multiple
females in each generation but the opposite is
rare
Layouts
Each child forms a glyph used to
show error
Divided into three parts
Up triangle coloured if error with dad
Down triangle coloured if error with
mum
Middle band coloured if error, but
parent in error is unknown (novel
allele)
Lo, pedigree-based error patterns
revealed themselves
Layouts
Tables full of data and histograms to show
error distribution by marker and individuals
also help
Cleaning
So, we can show errors nicely
But the aim is to get rid of all these errors
Masking is when we pretend we don’t know
the values for particular markers / individuals /
combinations thereof
What happens then is that those values are
inferred from the corresponding values in the
parents
A G C C C C
? ? C C C C
A G G C
Cleaning
The visualisations lets the biologist mask
individuals / bunches of markers / individual
genotype points / relationships
These are then shown in blue in the interface
Cleaning
This last point’s important as pedigree errors just propagate down the pedigree. A wrong parent for a child can’t be cured by hiding the child
It’s also why we cant clean these data sets automatically, the biologists judgement in what causes an error is crucial
The Goal
Eventually we want a display with no nasty red
colours and then we can save it as a “clean”
data set
Though obviously with lots of missing data
But the biologists say their tools can handle
missing things, but wrong things blow them up
And we did have to stick in a final “auto clean up”
button to fix sporadic errors that would have taken
ages to fix manually
But the major systematic errors are fixed by the
biologist
User Test
We did a user test with 11 biologists at Roslin
They preferred the new tool to the table-like
tool
Probably the most interesting thing past the
numbers was once again how much a bunch
of scientists are in thrall to Excel
Just like the taxonomists we’ve worked with /
social scientists we’re writing a proposal with
Which is why the Roslin guys made a table-a-like
tool in the first place to try and appease them
Conclusion
Built successful tool (got it published in
EuroVis, BioVis and AVI)
Whether it’s successful from the biologists
point of view...
During the project, marker set sizes jumped from
thousands to hundreds of thousands
Sequencing the data used to be the costly part of
the process, staff time to clean it up was relatively
cheap
Biology in general is having a data crisis, some
opinions say its cheaper/easier to redo
experiments than store the TBs of information
they produce in the first place
Conclusion
Available at www.viper-project.org
Did do JavaDocs this time
I enjoyed it