Upload
boris-adryan
View
458
Download
0
Tags:
Embed Size (px)
Citation preview
• Computational biologist• Research group leader• Advisor at• 2015 Fellow of the
Who is@BorisAdryan
• Why a biologist is interested in large, unstructured data
• What wrong is with the IoT in its current state
• How biologists deal with similar problems
• Which academic concepts would be useful in the IoT
WHAT TO EXPECT IN THE NEXT HOUR…(including questions!)
• Why a biologist is interested in large, unstructured data
• What wrong is with the IoT in its current state
• How biologists deal with similar problems
• Which academic concepts would be useful in the IoT
WHAT TO EXPECT IN THE NEXT 10 MINUTES
DNA = storage of a blueprint
RNA = ‘active copy’ of DNA
protein = the building blocks of cells and tissues
LIFE AS WE KNOW IT
transcription
translation
Gregor Johann Mendel,exhibited in the Library at the NIMR
‣ Reading DNA information
‣ Determining “the sequence of a gene” was a PhD in the early 1980s
‣ Data processing was mainly transcribing the observation into a research paper
BIOLOGY THEN AND NOWSEQUENCE INFORMATION
Sanger sequencing ca. 1980
http://www.eplantscience.com
189,739,230,107 bases base pairs on 15th April 2015(from 159,813,411,760 bases pairs in April 2015)
‣ We can sequence a human genome in half a day
‣ Sequence databases grow faster than storage capacity
‣ Data processing is the key step in scientific understanding
BIOLOGY THEN AND NOWSEQUENCE INFORMATION
1990: automation kilobases a day
2007: next-gen seq megabases a day
2015: 1000s of instruments world-wide
BIOLOGY THEN AND NOWGENE ACTIVITY INFORMATION
‣ When are genes needed?
‣ Classical molecular biology workflow, taking days…
‣ Data is semi-quantitative; testing one gene at the time
Northern blot, ca. 1995
‣ High-throughput gene expression profiling since mid-1990s
‣ Quantitative information for every gene in an organism
‣ Key challenge is the graphical representation and interpretation of the data
screenshot from FlyBase, today
26 ATP
‣ Signal transduction and metabolic pathways
‣ Characterisation of proteins and substrates that mediate chemical reactions
‣ Nobel prize material
BIOLOGY THEN AND NOWBIOCHEMISTRY
‣ We know about 250k metabolites
‣ 100k protein structures
‣ on the order of 10k different chemical reactions
BIOLOGY THEN AND NOWBIOCHEMISTRY
“The Robot Scientist”
“small molecules”(Organic & Biomolecular Chemistry Blog)
protein(via the Protein Databank, www.pdb.org)
‣Everything is connected ‣ Big, noisy, often
unstructured data
‣We are learning how biological entities depend on each other
DNA > RNA > proteins
• Why a biologist is interested in large, unstructured data
• What wrong is with the IoT in its current state
• How biologists deal with similar problems
• Which academic concepts would be useful in the IoT
WHAT TO EXPECT IN THE NEXT 5 MINUTES
‣ Everything is connected‣ Big, noisy, often
unstructured data
www.thingslearn.com
Analytics, context integration, machine learning and predictive modelling for the IoT.
0 clean shirt left +
washing machine estimates 97% of your last pack of powder used
+ it’s Wednesday, 23:55
+ the last four Thursdays had a
morning business meeting +
the car is parked 20 m from a shop +
last retail activity: 8 sec ago
Send immediate text reminder to pick up washing powder + send tweet from @BorisHouse
“need identified” + “notification appropriate”
Actionable insight. From everything.
NO ANALYTICAL FLEXIBILITY IN M2M/IOTMatt Hatton, Machina Research The BLN IoT ‘14
Internet replaces wire
It’s all about the context
M2M
consumer
IoT
defined I-P-O like it’s 1975
context
context
context
context
context
context
context
Is this hot?
LIFE SCIENCE STRATEGIES DON’T WORK IN THE IOT- There are no commonly accepted
- ‘catalogue’ of things,- ‘ontology’ of things,- ‘data format’ of things,- ‘meta data’ for things.
- Most businesses are driven by revenue, not long-term strategic vision
- Service providers have no need to publish
- Data can be highly personal (cheap excuse)
unless they’re
Trojan Roomcoffee pot -
ca. 1993
Oct. 1995
“The Internet of Things”Kevin Ashton, ca. 1999
20 YEARS OF NON-CONVERGENT EVOLUTION
FIRST DATA POTENTIAL RECOGNISED TODAY’S REALITY
“ignorant coexistence”
➡ Commonly accepted platforms and formats for data exchange
➡ Meta-data deposition is a must
➡ Infrastructure provides entry point for computational knowledge inference
“designed to ask questions”
• Why a biologist is interested in large, unstructured data
• What wrong is with the IoT in its current state
• How biologists deal with similar problems
• Which academic concepts would be useful in the IoT
WHAT TO EXPECT IN THE NEXT 10 MINUTES
Oct. 1995
TOWARDS MIAMI STANDARD AND DATA REPOSITORIES
cf. IoTNov. 1993
MInimal Annotation for MIcroarray Info
META DATA, SHARING AND DATA REPOSITORIES
founded in Nov. 1999
But this is a complex and ambitious project, and is one of the biggest challenges that bioinformatics has yet faced. Major difficulties stem from the detail required to describe the conditions of an experiment, and the relative and imprecise nature of measurements of expression levels. The potentially huge volume of data only adds to these difficulties.
NatureFeb. 2000
“
“
Nov. 2000 Oct. 2002
Wide adoption as requirement for publication in scientific journals
META DATA, SHARING AND DATA REPOSITORIES
cf. IoT 2014
since 2003
http://en.wikipedia.org/wiki/Silo
THE LIFE SCIENCES FIXED THEIR KNOWLEDGE REPRESENTATION PROBLEM
FORMALISING KNOWLEDGE
FORMALISING KNOWLEDGE WITH GENE ONTOLOGY
CURRENT GOVERNMENT INVESTMENTS INTO GENE ONTOLOGY
NIH alone spent $44,616,906 on the ontology structure since 2001(I don’t have data for UK/EU spendings)
~100 full-time salaries for experts with domain-specific knowledge
~40,000 terms
story
measurements + meta data
open, public repositories
human curators
ontology terms
community
PUBLISH OR PERISH
ok?
journal
informal exchange - no credit!
funders
assessment
The majority of this infrastructure is paid for by governments and charities
industry!
OUR PROBLEM IS KNOWLEDGE
DATA != INSIGHT
WITHOUT ORGANISING IT
• Why a biologist is interested in large, unstructured data
• What wrong is with the IoT in its current state
• How biologists deal with similar problems
• Which academic concepts would be useful in the IoT
WHAT TO EXPECT IN THE NEXT 10 MINUTES
measurements + meta data
storage & provenance
human curators
ontology terms
user
PUBLISH OR YOU’RE NOT DOING IOT
ok?
Maybe the majority of this infrastructure should be paid for by governments?
companycloud
device registration
“ “
privileges dataadded value
WHAT IS AN ONTOLOGY?
used to establish conceptual connection between entities
knowledge inference
fingerontology structure
- body part - limb - arm - hand - thumb - fingerontology rules
‣controlled vocabulary‣clearly defined relationships
is a
is a
connects to
part of
with ontological reasoning, a computer can infer that “finger is a body part”, although we
haven’t explicitly defined it that way
ARE PEOPLE NOT ALREADY USING ONTOLOGIES IN THE IOT?
Semantic Sensor Network Ontology
“thermostat”
The idea is not new! Cf. extension of the semantic web with the Semantic Sensor Network.
‣catalogs‣conventions
http://www.w3.org/2005/Incubator/ssn/ssnx/ssn
ONTOLOGIES HAVE TO BE PRAGMATIC COMPROMISES
Gene Ontology annotation
15 years of research47 publications100+ authors
50+ PhDs
15 direct annotations~150 inferred annotations
THE THREE BRANCHES OF
Adapted from Anurag et al., Mol. BioSyst., 2012,8, 346-352
Localization: Where is an entity acting?
Function: What does the entity do?
Process: When is the entity needed?
inferences on “is a”
“part of”
“regulates”
“has part”
from geneontology.org from Ashburner et al., Nat Genet. 2000, 25(1):25-9.
GO AND CONTEXT
THE BRANCHES OF GO AND THE IOTLocalization: inside, (my?) home, living room
Function:measures temperatureregulates temperature
interacts with user directlyinteracts with user via app
Process: regulation of temperaturemeasurement of ambient temperature
‘is proxy / is avatar’ forpresencefireice age
A LAST WORD ON PRAGMATISM
“perfect” ontology
The SSN Ontology allows for inference entirely on the basis of its structure and annotation.In reality, many parameters are difficult to establish and the effort to annotate things outweighs the utility.
“crude” ontology
A simplified structure allows for quick annotation even by non-specialists.The lack of details can lead to clashes in the ontology => more smartness has to go into software; more coding effort.
1 billlion
different things
1 milllion
use cases
0 clean shirt left +
washing machine estimates 97% of your last pack of powder used
+ it’s Wednesday, 23:55
+ the last four Thursdays had a
morning business meeting +
the car is parked 20 m from a shop +
last retail activity: 8 sec ago
Send immediate text reminder to pick up washing powder + send tweet from @BorisHouse
“need identified” + “notification appropriate”
Actionable insight. From everything.
“not home”
“buying”
credit card: “highly personal device” ~ alive and awake
3% left and
not pressed
“indicator of esteem”
Today’s biology is a quantitative, data-
rich science.
Infrastructure for ‘big data’ was driven by
academics.
Data is only useful if it can be turned into knowledge.
Understanding of data requires ‘data about
the data’.
Meta-data should be in a universally
understood format.Ontologies provide
context.
Gene Ontology (GO) is a de facto
standard.
Human curation is key to GO.
Public funders and industry contribute significantly to GO.
Should governments be involved in IoT?
GO is not a ‘one fits all’, but has a few useful concepts.
What does the thing do? Thing function.
For what can the thing be an avatar? Thing process.
Where is the thing? Thing localization.
@BorisAdryan