20141112 courtot big_datasemwebontologies

Preview:

DESCRIPTION

Guest lecture (MBB342) at Simon Fraser University on Big data, Semantic Web and ontologies

Citation preview

Big  data,  Seman-c  Web  and  Ontologies  

Mélanie  Courtot,  PhD  Nov  12th  2014  

mcourtot@sfu.ca  

1  

About  me  

2  

Overview  

3  

•  Big  Data  –  Big  Data  is  BIG  –  Issues  in  research  

•  SemanHc  Web  –  Standards:  URIs,  RDF,  SPARQL,  OWL  –  Linked  data  

•  Ontologies  –  DefiniHon  and  reasoning  –  OBO  Foundry  –  Example  of  exisHng  ontologies  –  Pharmacovigilance  –  Publishing  ontologies  on  the  SemanHc  Web  

•  IRIDA  –  The  IRIDA  plaXorm  –  Adding  standards  to  IRIDA  

•  Take  home  message    

Overview  

4  

•  Big  Data  –  Big  Data  is  BIG  –  Issues  in  research  

•  SemanHc  Web  –  Standards:  URIs,  RDF,  SPARQL,  OWL  –  Linked  data  

•  Ontologies  –  DefiniHon  and  reasoning  –  OBO  Foundry  –  Example  of  exisHng  ontologies  –  Pharmacovigilance  –  Publishing  ontologies  on  the  SemanHc  Web  

•  IRIDA  –  The  IRIDA  plaXorm  –  Adding  standards  to  IRIDA  

•  Take  home  message    

5  

Big  data  

Big  data  is  data  that  is  too  large  and  complex  to  process  for  any  convenHonal  data  tools.  

6  

7  

2005  

8  

2013  

What  is  a  Ze^abyte?  

1,000,000,000,000  gigabytes  1,000,000,000,000  terabytes  1,000,000,000,000  petabytes  1,000,000,000,000  exabytes  1,000,000,000,000  zeAabyte  

9  

How  big  is  big?  

•  Facebook:  25  Terabytes  of  logged  data  per  day,  Google  (2008):  20  Petabytes  per  day  

•  Over  90%  of  all  the  data  in  the  world  was  created  in  the  past  2  years  [1]  

•  Today  3.2  ze^abytes.  2020:  40  zeAabytes.[2]    •  Good  news:  jobs!  [3]  

1.  http://www-01.ibm.com/software/data/bigdata/ 2.  http://barnraisersllc.com/2012/12/38-big-facts-big-data-companies/ 3.  http://www.webopedia.com/quick_ref/important-big-data-facts-for-it-professionals.html

10  

11  h^ps://hbr.org/2012/10/data-­‐scienHst-­‐the-­‐sexiest-­‐job-­‐of-­‐the-­‐21st-­‐century  

12  

Issues  with  research  data  (1):  data  availability  

h^p://www.nature.com/news/scienHsts-­‐losing-­‐data-­‐at-­‐a-­‐rapid-­‐rate-­‐1.14416      

Issues  with  research  data  (2):  data  reproducibility  

13  h^p://www.firstwordpharma.com/node/931605#axzz3IalL2lzU    

Overview  

14  

•  Big  Data  –  Big  Data  is  BIG  –  Issues  in  research  

•  Seman-c  Web  –  Standards:  URIs,  RDF,  SPARQL,  OWL  –  Linked  data  

•  Ontologies  –  DefiniHon  and  reasoning  –  OBO  Foundry  –  Example  of  exisHng  ontologies  –  Pharmacovigilance  –  Publishing  ontologies  on  the  SemanHc  Web  

•  IRIDA  –  The  IRIDA  plaXorm  –  Adding  standards  to  IRIDA  

•  Take  home  message    

A  soluHon:  the  SemanHc  Web  

"The  Seman*c  Web  is  an  ...  extension  of  the  current  web  in  which  ...  informa*on  is  given  well-­‐defined  meaning,  ...  be?er  enabling  computers  and  people  to  work  in  coopera*on.”    The  Seman)c  Web  Tim  Berners-­‐Lee,  James  Hendler  and  Ora  Lassila  ScienHfic  American,  May  2001  

15  http://www.scientificamerican.com/article/the-semantic-web/  

Adds  to  Web  standards  and  prac*ces  (currently  only  for  documents  and  services)  encouraging  •  Unambiguous  names  for  things,  classes,  and  

relaHonships  •  Well  organized  and  documented  in  ontologies  •  With  data  expressed  using  uniform  knowledge  

representaHon  languages  (e.g.  OWL)  •  To  enable  computaHonally  assisted  exploitaHon  of  

informaHon  •  That  can  be  easily  integrated  from  different  sources  

The  SemanHc  Web  in  a  nutshell  

16  

Some  SemanHc  Web  successes  •  In  February  2011,  the  Watson  system  by  IBM  made  

internaHonal  headlines  for  beaHng  the  best  humans  in  the  quiz  show  Jeopardy!    

•  A  significant  number  of  very  prominent  websites  are  powered  by  Seman-c  Web  technologies,  including  the  New  York  Times,    Thomson  Reuters,  BBC,  and  Google's  Freebase.  

•  The  Speech  Interpreta-on  and  Recogni-on  Interface  Siri  launched  by  Apple  in  2011  as  an  intelligent  personal  assistant  for  the  new  generaHon  of  IPhone  smartphones  heavily  draws  from  work  on  ontologies,  knowledge  representaHon,  and  reasoning.  

17  h^p://130.108.5.60/faculty/pascal/pub/crc-­‐handbook-­‐13.pdf    

18  

Uniform  Resource  IdenHfiers  (URIs)  

•  Two  different  uses:  – Unambiguous  name  for  something  – LocaHon  of  a  document  

•  Examples:  – h^p://example.org/wiki/Main_Page    – sp://example.org/resource.txt  – mailto:someone@example.com  

19  

Resource  DescripHon  Framework  (RDF)  

• Resources (= nodes) •  Identified by Unique Resource Identifier (URI)

• Properties (= edges) •  Identified by Unique Resource Identifier (URI) •  Binary relations between 2 resources

20  h^p://elmonline.ca/sw/sparql/social.^l    

<h^p://www.linkedin.com/in/mcourtot>  a  foaf:Person  ;          foaf:name  "Melanie  Courtot"  ;          foaf:knows  <h^p://elmonline.ca/luke>  ;          foaf:knows  <h^p://www.linkedin.com/pub/mark-­‐wilkinson/1/674/665>  .

 21  

SPARQL  

SELECT  ?person  WHERE  {          <h^p://www.linkedin.com/in/mcourtot>  <h^p://xmlns.com/foaf/0.1/knows>  ?person  .  }    -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐  |  person                                                                                                                                                                                                                            |  ==========================================================  |  h^p://www.linkedin.com/pub/mark-­‐wilkinson/1/674/665                                    |  |  <h^p://elmonline.ca/luke>                                                                                                                                                    |  -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐    

•  An  excellent  tutorial  by  Luke  McCarthy:  h^p://elmonline.ca/sw/sparql/  

22  

A  query  language  for  RDF  

The  Web  Ontology  Language  (OWL)  

•  Knowledge  representaHon  language  •  Based  on  DescripHon  Logics:  fragments  of  

First-­‐Order  logics  with  decidable  and  defined  computaHonal  properHes  

•  Sound,  complete,  terminaHng  reasoners  available  

23  

Overview  

24  

•  Big  Data  –  Big  Data  is  BIG  –  Issues  in  research  

•  Seman-c  Web  –  Standards:  URIs,  RDF,  SPARQL,  OWL  –  Linked  data  

•  Ontologies  –  DefiniHon  and  reasoning  –  OBO  Foundry  –  Example  of  exisHng  ontologies  –  Pharmacovigilance  –  Publishing  ontologies  on  the  SemanHc  Web  

•  IRIDA  –  The  IRIDA  plaXorm  –  Adding  standards  to  IRIDA  

•  Take  home  message    

Linked  open  data  cloud  

25  

Biological  resources  in  LOD  

26  

Examples  of  issues  in  linking  data  incorrectly  

•  h^p://dbpedia.org/resource/Welsh    OWL:sameAs  <h^p://sw.cyc.com/2006/07/27/cyc/EthnicGroupOfWelsh>  <h^p://sw.cyc.com/2006/07/27/cyc/Welsh-­‐TheWord>  <h^p://sw.cyc.com/2006/07/27/cyc/WelshLanguage>  <h^p://sw.cyc.com/2006/07/27/cyc/Welshing-­‐Chea-ng>  

27  

Overview  

28  

•  Big  Data  –  Big  Data  is  BIG  –  Issues  in  research  

•  SemanHc  Web  –  Standards:  URIs,  RDF,  SPARQL,  OWL  –  Linked  data  

•  Ontologies  –  Defini-on  and  reasoning  –  OBO  Foundry  –  Example  of  exisHng  ontologies  –  Pharmacovigilance  –  Publishing  ontologies  on  the  SemanHc  Web  

•  IRIDA  –  The  IRIDA  plaXorm  –  Adding  standards  to  IRIDA  

•  Take  home  message    

Ontologies  •  RepresentaHon  of  important  things  in  a  specific  domain  

–  Describes  types  of  enHHes  (e.g.  cells)  and  relaHons  between  them  (e.g.  prokaryoHc  cells  and  eukaryoHc  cells  are  cells)  and  their  instances  (e.g.  the  specific  cells  in  my  sample)  

•  An  acHve  computaHonal  arHfact  –  A  mathemaHcal  model  based  on  a  subset  of  first  order  logic  –  Tools  can  automaHcally  process  ontologies  

•  A  communicaHon  tool  –  Provides  a  dicHonary  for  collaborators,  a  shared  understanding  –  Allows  data  sharing  

29  

Reasoning  is  criHcal  •  ProkaryoHc  and  EukaryoHc  

cell  are  declared  disjoints    •  Fungal  cell  is  a  EukaryoHc  

cell  •  Spore  is  a  Fungal  cell  and  a  

ProkaryoHc  cell  ⇒  InsaHsfiability  ⇒  SoluHon:  clarify  spore  

(sensu  Mycetozoa)  AND  acHnomycete-­‐type  spore  

h^p://www.plosone.org/arHcle/info:doi/10.1371/journal.pone.0022006  30  

Logics  

•  Simple  example  based  on  h^p://arxiv.org/pdf/1201.4089v1.pdf  

•  Ontology  file  available  from  h^p://www.sfu.ca/~mcourtot/course/20141112BigDataSemWebOntologies/ontology.owl  

•  ManipulaHon  done  using  Protégé:  h^p://protege.stanford.edu  

    31  

Family    ontology  

32  

Logics  of  a  grandfather  

33  

Reasoning  

34  

Inferred  class  hierarchy  

35  

Explana-ons  

36  

A  wrong  asser-on  

37  

Unsa-sfiability  

38  

Overview  

39  

•  Big  Data  –  Big  Data  is  BIG  –  Issues  in  research  

•  SemanHc  Web  –  Standards:  URIs,  RDF,  SPARQL,  OWL  –  Linked  data  

•  Ontologies  –  DefiniHon  and  reasoning  –  OBO  Foundry  –  Example  of  exis-ng  ontologies  –  Pharmacovigilance  –  Publishing  ontologies  on  the  SemanHc  Web  

•  IRIDA  –  The  IRIDA  plaXorm  –  Adding  standards  to  IRIDA  

•  Take  home  message    

OBO  Foundry  

A  subset  of  biological  and  biomedical  ontologies  whose  developers  have  agreed  in  advance  to  accept  a  common  set  of  principles  reflecHng  best  pracHce  in  ontology  development  designed  to  ensure    

•  Hght  connecHon  to  the  biomedical  basic  sciences  •  CompaHbility  

•  interoperability,  common  relaHons  •  formal  robustness    •  support  for  logic-­‐based  reasoning  

       

40  

41  hAp://www.obofoundry.org    

                                       RELATION                                TO  TIME  

 

 GRANULARITY

CONTINUANT OCCURRENT

INDEPENDENT DEPENDENT

ORGAN  AND ORGANISM

Organism (NCBI

Taxonomy?)

Anatomical  Entity (FMA,  CARO)

Organ Function

(FMP,  CPRO) Phenotypic  Quality  (PaTO)

Organism-­‐‑Level  Process (GO)

CELL  AND  CELLULAR  

COMPONENT

Cell (CL)

Cellular  Component (FMA,  GO)

Cellular  Function (GO)

Cellular  Process (GO)

MOLECULE Molecule (ChEBI,  SO, RnaO,  PrO)

Molecular  Function (GO)

Molecular  Process (GO)

Slide  credit:  Barry  Smith    

42  

Minimum  InformaHon  to  Reuse  an  External  Ontology  Term  

•  OBO  and  SemaHc  Web  promote  reuse  of  resources  

•  Biological  resources  (e.g.,  FMA  for  anatomy),  taken  together,  are  too  big  for  current  tool  support.  

•  MIREOT  used  across  the  OBO  library  – OBI:  400  mireoted  terms  (140  GO,  55  ChEBI,  50  PATO)  –  PR  (Protein  Ontology):  23,000  mireoted  terms  

•  h^p://ontofox.hegroup.org    

43  

Example  of  OBO  ontologies  

•  OBI,  Ontology  for  Biomedical  invesHgaHons  •  VO,  the  vaccine  ontology  •  AERO,  the  Adverse  Event  ReporHng  Ontology  

Ontology  for  Biomedical  InvesHgaHons  (OBI)  

•  OBI  is  a  mulH-­‐community  project  driven  by  the  pracHcal  needs  of  its  members  with  the  goal  to  build  a  high  quality,  interoperable  reference  ontology  

•  OBI  high  level  classes  are  in  place  -­‐  solidified  over  several  years  -­‐  that  cover  all  aspects  of  biomedical  invesHgaHons  

•  OBI  is  expanded  to  enable  member  applicaHons  and  based  on  term  requests  

45  

46  

High  level  class  hierarchy  (parHal)  

Slide  credit:  OBI  Consor)um    

Slide  credit:  Alan  Ru=enberg  47  

48  Slide  credit:  OBI  Consor)um    

49  

RepresenHng  vaccine  data  –  the  Vaccine  Ontology  (VO)  

Picture  credit:  Yongqun  He  

Overview  

50  

•  Big  Data  –  Big  Data  is  BIG  –  Issues  in  research  

•  SemanHc  Web  –  Standards:  URIs,  RDF,  SPARQL,  OWL  –  Linked  data  

•  Ontologies  –  DefiniHon  and  reasoning  –  OBO  Foundry  –  Example  of  exisHng  ontologies  –  Pharmacovigilance  –  Publishing  ontologies  on  the  SemanHc  Web  

•  IRIDA  –  The  IRIDA  plaXorm  –  Adding  standards  to  IRIDA  

•  Take  home  message    

RepresenHng  pharmacovigilance  data  

•  The  Adverse  Event  ReporHng  Ontology  (AERO)  

•  Encodes  exisHng  clinical  guidelines  (Brighton  CollaboraHon)  

'found to exhibit' some 'generalized urticaria or generalized erythema finding''found to exhibit' some 'measured hypotension finding'

inferred to be of typeinferred to be of type

major dermatological criterion for anaphylaxis according to Brighton

major cardiovascular criterionfor anaphylaxis according to Brighton

Level 1 of certainty of anaphylaxis according to Brighton

has component has component

Patient examination

has specified outputhas participant

exam report of June 7has specified input

finding of rashPatient

rash

dermatologicalsystem

Medicallyrelevant entity

Anatomical system

Clinical Finding

about mre

Clinical Report

part oflocated in

Clinician

involves

has participant

is aboutfound to exhibit

51  

Background  and  problem  statement  

•  Surveillance  of  Adverse  Events  Following  Immuniza-on  is  important  –  DetecHon  of  issues  with  vaccine    –  Importance  of  vaccine-­‐risk  communicaHon  

•  Analysis  of  AE  reports  is  a  subjec-ve,  -me-­‐  and  money  costly  process  – Manual  review  of  the  textual  reports  

52  

Workflow  •  Hypothesis:  Use  the  AERO  I  developed  to  annotate  

and  classify  a  dataset  •  VAERS  dataset  

– Vaccine  Adverse  Event  ReporHng  System  – 6032  reports:  ~5800  negaHve,  ~230  posiHve  – Post  H1N1  immunizaHon  2009/2010  – Manually  classified  for  anaphylaxis    

•  MedDRA  (Medical  DicHonary  of  Regulatory  AcHviHes)  is  used  to  represent  clinical  findings  

 53  

54  

Automated  Diagnosis  workflow  

ADVERSE EVENT REPORTING ONTOLOGY

(AERO)

OWL/RDFEXPORT

VAERS DATASET

MySQL

BRIGHTON ANNOTATIONS

ASCII files MySQL

~800 MedDRA terms mapped to 32 Brighton terms

REASONER

?

MANUALLY CURATEDDATASET

A

B

C

D

55  

Results  

ADVERSE EVENT REPORTING ONTOLOGY

(AERO)

OWL/RDFEXPORT

VAERS DATASET

MySQL

BRIGHTON ANNOTATIONS

ASCII files MySQL

~800 MedDRA terms mapped to 32 Brighton terms

REASONER

?

MANUALLY CURATEDDATASET

A

B

C

D

At  best  cut-­‐off  point:    Sensi-vity  57%  Specificity  97%  

56  

AE  classificaHon  can  be  improved  through  the  use  of  ontologies  

•  Manual  analysis:  3  months  for  12  medical  officers  •  Ontology-­‐based  analysis:  once  data  collected  (2  months),  almost  

instantaneous  (2h  on  laptop)    =>  Could  allow  for  earlier  detecHon  of  safety  issues  and  be^er  understanding  of  adverse  events  

November 2009 December 2009 January 2010

Time gain

Ability to detect signal

Time

6000reports

Manual analysisOntology-based

analysis

Legend

2h  automated  vs.  

3  months  manual  

h^p://dx.doi.org/10.1371/journal.pone.0092632    

Overview  

57  

•  Big  Data  –  Big  Data  is  BIG  –  Issues  in  research  

•  SemanHc  Web  –  Standards:  URIs,  RDF,  SPARQL,  OWL  –  Linked  data  

•  Ontologies  –  DefiniHon  and  reasoning  –  OBO  Foundry  –  Example  of  exisHng  ontologies  –  Pharmacovigilance  –  Publishing  ontologies  on  the  Seman-c  Web  

•  IRIDA  –  The  IRIDA  plaXorm  –  Adding  standards  to  IRIDA  

•  Take  home  message    

IRI  dereferencing  

58  

59  

Ontobee:  publishing  biomedical  resources  on  the  SemanHc  Web  

HTML  for  humans  …  

…  RDF  for  machines  

Ontobee:  publishing  biomedical  resources  on  the  SemanHc  Web  

Overview  

61  

•  Big  Data  –  Big  Data  is  BIG  –  Issues  in  research  

•  SemanHc  Web  –  Standards:  URIs,  RDF,  SPARQL,  OWL  –  Linked  data  

•  Ontologies  –  DefiniHon  and  reasoning  –  OBO  Foundry  –  Example  of  exisHng  ontologies  –  Pharmacovigilance  –  Publishing  ontologies  on  the  SemanHc  Web  

•  IRIDA  –  The  IRIDA  plaborm  –  Adding  standards  to  IRIDA  

•  Take  home  message    

The  Integrated  Rapid  InfecHous  Disease  Analysis  (IRIDA)  project  

•  Goal:  automate  infecHous  disease  outbreak  detecHon  and  invesHgaHon  

•  Issues:    –  Integrate  WGS,  clinical  and  lab  info  –  Provide  relevant  tools  and  validate  pipeline  

•  Methods:  – Data  standards  for  informaHon  exchange  – Analysis  pipeline  (Galaxy  based)  – User  interface  – AddiHonal  tools:    

•  IslandViewer  •  GenGIS  

62  

63  

Building  the  IRIDA  data  standards  

•  Interview  with  key  personnel  at  BCCDC  •  Review  of  exisHng  resources  •  IdenHfy  “holes”,  i.e.,  missing  bits  •  Collect  exisHng  data  •  Liaise  with  implementaHon  team  •  Generate  cohesive  resource  •  Validate  

64  

Relevant  data  standards  •  TypON,  the  typing  ontology  •  OBI,  the  ontology  for  Biomedical  InvesHgaHons  •  NGSOnto,  Next  GeneraHon  Sequencing  Ontology  •  NIAIS-­‐GS-­‐BRC  core  metadata  •  TRANS,  Pathogen  Transmission  ontology  •  ExO,  Exposure  Ontology  •  EPO,  Epidemiology  Ontology  •  IDO,  InfecHous  Disease  Ontology  •  Food:  USDA,  EFSA?  

65  

Relevant  internaHonal  efforts  

•  MIxS  standard  •  Global  Microbial  IdenHfier  •  Global  Alliance  for  Genomics  and  Health  •  NCBI  BioSample  •  European  NucleoHde  Archive  •  …  

66  

Remaining  challenges  

•  Trust,  provenance  – Ability  to  track  origin  of  data  to  assess  whether  it  is  trustworthy  

•  Data  sharing,  reuse,  policy  – Social  and  legal  issues  in  ge�ng  access  to  data  

•  ConfidenHality  – Privacy  concerns  when  linking  data  

67  

Overview  

68  

•  Big  Data  –  Big  Data  is  BIG  –  Issues  in  research  

•  SemanHc  Web  –  Standards:  URIs,  RDF,  SPARQL,  OWL  –  Linked  data  

•  Ontologies  –  DefiniHon  and  reasoning  –  OBO  Foundry  –  Example  of  exisHng  ontologies  –  Pharmacovigilance  –  Publishing  ontologies  on  the  SemanHc  Web  

•  IRIDA  –  The  IRIDA  plaXorm  –  Adding  standards  to  IRIDA  

•  Take  home  message    

Take  home  message  

Big  data  is  a  big  challenge,  but  we  can  deal  with  it  if  done  properly:  that  will  be  your  responsibility      DO  NOT  build  a  black  box  DO  annotate  and  describe  your  data  DO  make  your  data  openly  available  

69  

Acknowledgements  

•  Drs.  Fiona  Brinkman,  Will  Hsiao,  Ryan  Brinkman  •  The  Brinkman^2  labs  •  Alan  Ru^enberg,  Barry  Smith,  Chris  Mungall  &  

OBO  •  Colleagues  at  Public  Health  Agency  Canada  (Ms  

Lafleche,  Dr  Law)  •  The  IRIDA  consorHum  and  the  IRIDA  ontology  

working  group  (Emma  Griffiths  and  Damion  Dooley)  

70  

71  

Mélanie  Courtot,  PhD  mcourtot@sfu.ca  

@mcourtot  h^p://purl.org/net/mcourtot  

 

Recommended