33
RDF4PMC, RDFizing PubMed Central Alexander Garcia 1 , Leyla Jael García Castro 2 , Casey McLaughlin 1 1 Florida State University 2 Universitat Jaume I 2/6/13 Biotea, RDF4PMC 1

RDF4PMC, RDFizing! PubMedCentral! - bioontology.org · 2013-02-06 · RDF4PMC, RDFizing! PubMedCentral! Alexander)Garcia1, Leyla)Jael)García)Castro2,)Casey)McLaughlin1)) 1FloridaState)University))

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: RDF4PMC, RDFizing! PubMedCentral! - bioontology.org · 2013-02-06 · RDF4PMC, RDFizing! PubMedCentral! Alexander)Garcia1, Leyla)Jael)García)Castro2,)Casey)McLaughlin1)) 1FloridaState)University))

 RDF4PMC,  RDFizing  PubMed  Central  

Alexander  Garcia1,  Leyla  Jael  García  Castro2,  Casey  McLaughlin1    1Florida  State  University    2Universitat  Jaume  I  

2/6/13  

Biotea,  R

DF4P

MC  

1  

Page 2: RDF4PMC, RDFizing! PubMedCentral! - bioontology.org · 2013-02-06 · RDF4PMC, RDFizing! PubMedCentral! Alexander)Garcia1, Leyla)Jael)García)Castro2,)Casey)McLaughlin1)) 1FloridaState)University))

Outline    •  The  Biotea  project    •  Why  SemanNc  Web  Technologies?    •  RDF4PMC  in  a  nutshell  •  Architecture  •  RDFizaNon  process  

•  PMC  RDFizaNon  •  Content  enrichment  •  Some  numbers  for  RDF4PMC    •  Architecture  

•  Using  the  data    •  SPARQL  •  Bio2RDF  integraNon  •  Web  services  •  A  first  prototype  

•  Challenges  and  Lessons  •  Currently  working  on…  •  Future  Work  •  Conclusions    •  Acknowledgments  

2/6/13  

Biotea,  R

DF4P

MC  

2  

Page 3: RDF4PMC, RDFizing! PubMedCentral! - bioontology.org · 2013-02-06 · RDF4PMC, RDFizing! PubMedCentral! Alexander)Garcia1, Leyla)Jael)García)Castro2,)Casey)McLaughlin1)) 1FloridaState)University))

Biotea  

•  Methodologies,  methods  and  techniques  supporNng  semanNc  enrichment  of  scholarly  communicaNon  

•  Once  enriched,  then  how  is  this  changing  our  user  experience?  

2/6/13  

Biotea,  R

DF4P

MC  

3  

Scholarly  data  and  documents  are  of  most  value  when  they  are  interconnected  rather  than  

independent    ChrisNne  L.  Borgman  

Page 4: RDF4PMC, RDFizing! PubMedCentral! - bioontology.org · 2013-02-06 · RDF4PMC, RDFizing! PubMedCentral! Alexander)Garcia1, Leyla)Jael)García)Castro2,)Casey)McLaughlin1)) 1FloridaState)University))

Biotea  

•  How  are  publicaNons  connected  to  each  other?    •  Pu\ng   together   explicit   asserNons   from   different   papers   to  form  new  implicit  asserNons  

•  SemanNc   Web   Techno logy   supporNng   scho lar ly  communicaNon,   Literature   Based   Discovery   and   the   Search-­‐Retrieval-­‐and-­‐InteracNng-­‐with-­‐the-­‐Document   (SRID)  processes  

2/6/13  

Biotea,  R

DF4P

MC  

4  

Scholarly  data  and  documents  are  of  most  value  when  they  are  interconnected  rather  than  independent  

 ChrisNne  L.  Borgman    

Page 5: RDF4PMC, RDFizing! PubMedCentral! - bioontology.org · 2013-02-06 · RDF4PMC, RDFizing! PubMedCentral! Alexander)Garcia1, Leyla)Jael)García)Castro2,)Casey)McLaughlin1)) 1FloridaState)University))

Why  SWT  for  research  documents  •  Generates  an  adaptable  open  approach,  the  data  becomes  the  plaaorm  

•  The  SW  delivers  an  integraNve  plaaorm    •  Makes  it  easier  for  the  community  to  build  over  the  plaaorm      •  Simplifies  programmaNc  access  to  informaNon  

•   Retrieve  all  papers  that  have  a  component  X  (CHEBI)  and  the  cellular  locaNon  in  GO  terms  •  As  simple  as  relaNng  terminologies  

•  Delivers  Social  Network  ready  content    

2/6/13  

Biotea,  R

DF4P

MC  

5  

Page 6: RDF4PMC, RDFizing! PubMedCentral! - bioontology.org · 2013-02-06 · RDF4PMC, RDFizing! PubMedCentral! Alexander)Garcia1, Leyla)Jael)García)Castro2,)Casey)McLaughlin1)) 1FloridaState)University))

RDF4PMC  in  a  nutshell    

• Delivers  an  interoperable,  interlinked,  and  self-­‐describing  document  model  in  the  biomedical  domain.    

• A  network  of  interconnected  documents  •  SemanNc  infrastructure  for  PMC  • An  interface  to  the  Web  of  Data  • A  knowledge  model  for  biomedical  literature  –  easily  extendible      

2/6/13  

Biotea,  R

DF4P

MC  

6  

Page 7: RDF4PMC, RDFizing! PubMedCentral! - bioontology.org · 2013-02-06 · RDF4PMC, RDFizing! PubMedCentral! Alexander)Garcia1, Leyla)Jael)García)Castro2,)Casey)McLaughlin1)) 1FloridaState)University))

RDF4PMC  in  a  nutshell    

•  RDFizing  biomedical  literature  by  orchestraNng  ontologies  such  as  •   DoCO,  BIBO,  DC,  FOAF,  W3CPROV,    and  others    

•  Datasets  are  available  •  RDF  for  metadata  and  content  •  RDF  for  annotaNons  from  text-­‐mining  

•  RDFizator  will  be  available  •  Adding  other  ontologies  and  annotators  is  possible    • Working  with  XML  from  other  sources  is  possible  

2/6/13  

Biotea,  R

DF4P

MC  

7  

Page 8: RDF4PMC, RDFizing! PubMedCentral! - bioontology.org · 2013-02-06 · RDF4PMC, RDFizing! PubMedCentral! Alexander)Garcia1, Leyla)Jael)García)Castro2,)Casey)McLaughlin1)) 1FloridaState)University))

RDFReactor  

PMC  RDFization  

PMC  NXML  

RDF  GeneraNon  BIBO

References  Enrichment  

Metadata+  Content  +  References  

2/6/13  

Biotea,  R

DF4P

MC  

8  

Page 9: RDF4PMC, RDFizing! PubMedCentral! - bioontology.org · 2013-02-06 · RDF4PMC, RDFizing! PubMedCentral! Alexander)Garcia1, Leyla)Jael)García)Castro2,)Casey)McLaughlin1)) 1FloridaState)University))

2/6/13  

Biotea,  R

DF4P

MC  

9  

Page 10: RDF4PMC, RDFizing! PubMedCentral! - bioontology.org · 2013-02-06 · RDF4PMC, RDFizing! PubMedCentral! Alexander)Garcia1, Leyla)Jael)García)Castro2,)Casey)McLaughlin1)) 1FloridaState)University))

AutomaNc  AnnotaNon  

RDF  GeneraNon  

Annotations:  Content  Enrichment  

Metadata+  Content  +  References  

Web  service  Web  service  

Enriched  RDF  

2/6/13  

Biotea,  R

DF4P

MC  

10  

Page 11: RDF4PMC, RDFizing! PubMedCentral! - bioontology.org · 2013-02-06 · RDF4PMC, RDFizing! PubMedCentral! Alexander)Garcia1, Leyla)Jael)García)Castro2,)Casey)McLaughlin1)) 1FloridaState)University))

2/6/13  

Biotea,  R

DF4P

MC  

11  

Page 12: RDF4PMC, RDFizing! PubMedCentral! - bioontology.org · 2013-02-06 · RDF4PMC, RDFizing! PubMedCentral! Alexander)Garcia1, Leyla)Jael)García)Castro2,)Casey)McLaughlin1)) 1FloridaState)University))

RDF4PMC,  some  numbers  

2/6/13  

Biotea,  R

DF4P

MC  

12  

Page 13: RDF4PMC, RDFizing! PubMedCentral! - bioontology.org · 2013-02-06 · RDF4PMC, RDFizing! PubMedCentral! Alexander)Garcia1, Leyla)Jael)García)Castro2,)Casey)McLaughlin1)) 1FloridaState)University))

RDF4PMC Server Architecture

Master  Server  

Web  &  SPARQL  Server  

(development)    

Web  &  SPARQL  Server  

(produc<on)    

RDF  DB  Master  

RDF  DB  Slave  

RDF  DB  Slave  

PMC  RDFiza<on  

Import  scripts  +  RDF  files  

Page 14: RDF4PMC, RDFizing! PubMedCentral! - bioontology.org · 2013-02-06 · RDF4PMC, RDFizing! PubMedCentral! Alexander)Garcia1, Leyla)Jael)García)Castro2,)Casey)McLaughlin1)) 1FloridaState)University))

Consuming  the  data:  SPARQL  

2/6/13  

Biotea,  R

DF4P

MC  

14  

SPARQL  query   Query  expressed  in  natural  

language

SELECT  ?pmid  ?<tle  ?secTitle  ?text  

WHERE  {  

?ar<cle  a  bibo:Document  ;  

bibo:pmid  ?pmid  ;  

dcterms:<tle  ?<tle  .  

?sec<on  a  doco:Sec<on  ;  

dcterms:isPartOf  ?ar<cle  ;  

dcterms:<tle  ?secTitle  .  

FILTER  (regex(str(?secTitle),  "introduc<on",  "i")).  

?para  a  doco:Paragraph  ;  

dcterms:isPartOf  ?sec<on  ;  

cnt:chars  ?text  .  

FILTER  (regex(str(?text),  "cancer",  "i")).  

}  LIMIT  50

à

Retrieving  PubMed  

idenNfier,  arNcle  Ntle,  

secNon  Ntle,  and  

paragraphs  for  those  

arNcles  containing  the  

term  “cancer”  in  any  

secNon  whose  Ntle  

includes  “introducNon”

Page 15: RDF4PMC, RDFizing! PubMedCentral! - bioontology.org · 2013-02-06 · RDF4PMC, RDFizing! PubMedCentral! Alexander)Garcia1, Leyla)Jael)García)Castro2,)Casey)McLaughlin1)) 1FloridaState)University))

Consuming  the  data:  SPARQL  

2/6/13  

Biotea,  R

DF4P

MC  

15  

SPARQL  query   Query  expressed  in  natural  

language  

SELECT  dis<nct  ?pmid  

WHERE  {    

?ar<cle  a  bibo:AcademicAr<cle  ;    

bibo:pmid  ?pmid  .  

?annota<on  a  aot:ExactQualifier  ;  

ao:annotatesResource  ?ar<cle  ;  

ao:hasTopic  <h[p://purl.obolibrary.org/obo/CHEBI_60004>  .  

}

à

Retrieving  PubMed  idenNfier  

for  those  arNcles  that  have  

been  semanNcally  annotated  

with  the  biological  enNty  

CHEBI:60004.  The  semanNc  

annotaNon  comes  from  the  

occurrence  of  the  term  

“mixture”  in  any  paragraph  

of  the  retrieved  arNcles.

CHEBI:60004      A  mixture  is  a  chemical  substance  composed  of  mulNple  molecules,  at  least  two  of  which  are  of  a  different  kind  

Page 16: RDF4PMC, RDFizing! PubMedCentral! - bioontology.org · 2013-02-06 · RDF4PMC, RDFizing! PubMedCentral! Alexander)Garcia1, Leyla)Jael)García)Castro2,)Casey)McLaughlin1)) 1FloridaState)University))

Bio2RDF  Integration  

2/6/13  

Biotea,  R

DF4P

MC  

16  

BIBO

Metadata  &  Referen

ces  

Conten

t  An

notaNo

ns  

Page 17: RDF4PMC, RDFizing! PubMedCentral! - bioontology.org · 2013-02-06 · RDF4PMC, RDFizing! PubMedCentral! Alexander)Garcia1, Leyla)Jael)García)Castro2,)Casey)McLaughlin1)) 1FloridaState)University))

Consuming  the  data:  Web  services  

2/6/13  

Biotea,  R

DF4P

MC  

17  

Retrieval Service

A  list  of  terms  and  their  related  topics hqp://biotea.idiginfo.org/api/terms

A  list  of  topics  and  their  related  vocabularies hqp://biotea.idiginfo.org/api/topics

All  topics  related  to  a  term e.g.,  hqp://biotea.idiginfo.org/api/topics?term=cancer

All  vocabularies  related  to  a  term e.g.,  hqp://biotea.idiginfo.org/api/vocabularies?term=cancer

All  terms  that  start  with  a  specific  string  (for  autocomple<on) e.g.,hqp://biotea.idiginfo.org/api/terms?prefix=canc

All  topics  related  to  a  vocabulary e.g.,  hqp://biotea.idiginfo.org/api/topics?vocabulary=po

RDF  of  ar<cles  that  include  a  term e.g.,  hqp://biotea.idiginfo.org/api/arNcles?term=cancer

Count  of  RDF  of  ar<cles  that  include  a  term e.g.,  hqp://biotea.idiginfo.org/api/arNcles?term=cancer&count=true

A  list  of  vocabularies  and  their  prefixes hqp://biotea.idiginfo.org/vocabularies

RDF  of  ar<cles  that  include  a  vocabulary e.g.,  hqp://biotea.idiginfo.org/api/arNcles?vocabulary=po

Page 18: RDF4PMC, RDFizing! PubMedCentral! - bioontology.org · 2013-02-06 · RDF4PMC, RDFizing! PubMedCentral! Alexander)Garcia1, Leyla)Jael)García)Castro2,)Casey)McLaughlin1)) 1FloridaState)University))

Metadata+  Content  +  References  

AutomaNcally    Annotated  RDF  

Consuming  the  data:  a  dashboard  for  semantic  bio-­‐publications  

Catalase  

SPARQL  

SemanNcally  enriched  publicaNon  

2/6/13  

Biotea,  R

DF4P

MC  

18  

Page 19: RDF4PMC, RDFizing! PubMedCentral! - bioontology.org · 2013-02-06 · RDF4PMC, RDFizing! PubMedCentral! Alexander)Garcia1, Leyla)Jael)García)Castro2,)Casey)McLaughlin1)) 1FloridaState)University))

Consuming  the  data:  Lirst  prototype  

2/6/13  

Biotea,  R

DF4P

MC  

19  

Links  

Title  &  authors  

Cloud  of  Bio-­‐annota<ons  (term  +  #  of  bio-­‐en<<es)  

Abstract    

Paragraphs  containing  the  annota<on  selected    

by  the  user  

Graphical  tools  

Page 20: RDF4PMC, RDFizing! PubMedCentral! - bioontology.org · 2013-02-06 · RDF4PMC, RDFizing! PubMedCentral! Alexander)Garcia1, Leyla)Jael)García)Castro2,)Casey)McLaughlin1)) 1FloridaState)University))

Consuming  the  data:  A  Lirst  prototype  

2/6/13  

Biotea,  R

DF4P

MC  

20  

Page 21: RDF4PMC, RDFizing! PubMedCentral! - bioontology.org · 2013-02-06 · RDF4PMC, RDFizing! PubMedCentral! Alexander)Garcia1, Leyla)Jael)García)Castro2,)Casey)McLaughlin1)) 1FloridaState)University))

Challenges  and  Lessons  •  Content  

•  Tables  and  images  à  Links  •  Inline  tables  à  Format  is  lost  •  Supplementary  material  •  Most  of  them  follow  one  DTD  but  …    

•  References  •  At  least  4  different  styles  •  Some  Nmes  are  just  plain  text  

•  Annotators  •  Not  always  available  •  Stop  words  are  tricky  

2/6/13  

Biotea,  R

DF4P

MC  

21  

Page 22: RDF4PMC, RDFizing! PubMedCentral! - bioontology.org · 2013-02-06 · RDF4PMC, RDFizing! PubMedCentral! Alexander)Garcia1, Leyla)Jael)García)Castro2,)Casey)McLaughlin1)) 1FloridaState)University))

Challenges  and  Lessons  •  Where  are  the  facts?  How  to  validate  the  facts?  

•  Delivering  the  expressivity  of  the  data  set  to  the  end  user  is  a  complex  issue  

•  AnnotaNon  is  context  dependent  

•  Maintaining  the  triplet  store  has  a  learning  curve  of  its  own    •  Building  SW  infrastructure  is  H  A  R  D  

2/6/13  

Biotea,  R

DF4P

MC  

22  

Page 23: RDF4PMC, RDFizing! PubMedCentral! - bioontology.org · 2013-02-06 · RDF4PMC, RDFizing! PubMedCentral! Alexander)Garcia1, Leyla)Jael)García)Castro2,)Casey)McLaughlin1)) 1FloridaState)University))

Currently  working  on:  Literature  Discovery  Process    

•  Search    •  Usually  string-­‐based  search  mechanisms    •  Li[le  cogni<ve  support  

•  Retrieval  •  Simple  list  of  DB  entries  •  Liqle  cogniNve  support  

•  Interac<ng  with  the  document    •  Straight  into  the  PDF  •  Zero  cogniNve  support  •  Data  availability        

Page 24: RDF4PMC, RDFizing! PubMedCentral! - bioontology.org · 2013-02-06 · RDF4PMC, RDFizing! PubMedCentral! Alexander)Garcia1, Leyla)Jael)García)Castro2,)Casey)McLaughlin1)) 1FloridaState)University))
Page 25: RDF4PMC, RDFizing! PubMedCentral! - bioontology.org · 2013-02-06 · RDF4PMC, RDFizing! PubMedCentral! Alexander)Garcia1, Leyla)Jael)García)Castro2,)Casey)McLaughlin1)) 1FloridaState)University))

Currently  working  on:  Literature  Discovery  Process    

•  Search    •  Usually  string-­‐based  search  mechanisms    •  Liqle  cogniNve  support  

•  Retrieval  •  Simple  list  of  DB  entries  •  Li[le  cogni<ve  support  •  How,  why  and  where  are  a  set  of  documents  similar?  

•  Interac<ng  with  the  document    •  Straight  into  the  PDF  •  Zero  cogniNve  support      

Page 26: RDF4PMC, RDFizing! PubMedCentral! - bioontology.org · 2013-02-06 · RDF4PMC, RDFizing! PubMedCentral! Alexander)Garcia1, Leyla)Jael)García)Castro2,)Casey)McLaughlin1)) 1FloridaState)University))
Page 27: RDF4PMC, RDFizing! PubMedCentral! - bioontology.org · 2013-02-06 · RDF4PMC, RDFizing! PubMedCentral! Alexander)Garcia1, Leyla)Jael)García)Castro2,)Casey)McLaughlin1)) 1FloridaState)University))

Currently  working  on:  Literature  Discovery  Process    

•  Search    •  Usually  string-­‐based  search  mechanisms    •  Liqle  cogniNve  support  

•  Retrieval  •  Simple  list  of  DB  entries  •  Liqle  cogniNve  support  

•  Interac<ng  with  the  document    •  Straight  into  the  PDF  •  Zero  cogni<ve  support      

Page 28: RDF4PMC, RDFizing! PubMedCentral! - bioontology.org · 2013-02-06 · RDF4PMC, RDFizing! PubMedCentral! Alexander)Garcia1, Leyla)Jael)García)Castro2,)Casey)McLaughlin1)) 1FloridaState)University))
Page 29: RDF4PMC, RDFizing! PubMedCentral! - bioontology.org · 2013-02-06 · RDF4PMC, RDFizing! PubMedCentral! Alexander)Garcia1, Leyla)Jael)García)Castro2,)Casey)McLaughlin1)) 1FloridaState)University))

Future  Work  •  RDF  

•  URI  standardizaNon  following  similar  paqerns  to  idenNfiers.org  and  Bio2RDF  

•  IntegraNon  into  Bio2RDF  •  Dataset  idenNficaNon  and  summary  (void)  •  Improve  data  for  references  

•  User  Experience  •  Web  services  for  data  analysis  •  RDF  browser    •  More  visualizaNon  tools  •  SupporNng  and  taking  advantage  of  the  structure  of  the  document  

•  CollaboraNve  element  

2/6/13  

Biotea,  R

DF4P

MC  

29  

Page 30: RDF4PMC, RDFizing! PubMedCentral! - bioontology.org · 2013-02-06 · RDF4PMC, RDFizing! PubMedCentral! Alexander)Garcia1, Leyla)Jael)García)Castro2,)Casey)McLaughlin1)) 1FloridaState)University))

Future  Work  • ApplicaNon  in  Clinical  Psychology,  the  MSRC  case

   •  From  PDF  to  XML  to  RDF  to  Enriched  Metadata  for  the  PDF  

•  The  PDF  is  gently  introduced  in  the  WoD  • Once  the  metadata  has  been  enriched  then  

•  Rich  interacNon  supporNng:  SEARCH-­‐RETRIEVAL-­‐INTERACTION  WITH  THE  DOCUMENT  (PDF)  

2/6/13  

Biotea,  R

DF4P

MC  

30  

Page 31: RDF4PMC, RDFizing! PubMedCentral! - bioontology.org · 2013-02-06 · RDF4PMC, RDFizing! PubMedCentral! Alexander)Garcia1, Leyla)Jael)García)Castro2,)Casey)McLaughlin1)) 1FloridaState)University))

Conclusions  •  We  provide    

•  the  transformaNon  into  RDF  from  the  original  PMC  files  •  the  annotaNon  of  the  RDF  •  an  API  which  makes  that  data  available.    

•  New  vocabularies  as  well  as  annotators  can  easily  be  plugged  in  •  Our  approach  is  useful  for  both  open  and  non-­‐open  access  datasets  •  Publishers  may  decide  what  to  expose  via  RDF  and  what  content  to  make  available  

•  Our  approach  is  also  applicable  for  PDF-­‐only  environments    

2/6/13  

Biotea,  R

DF4P

MC  

31  

Page 32: RDF4PMC, RDFizing! PubMedCentral! - bioontology.org · 2013-02-06 · RDF4PMC, RDFizing! PubMedCentral! Alexander)Garcia1, Leyla)Jael)García)Castro2,)Casey)McLaughlin1)) 1FloridaState)University))

Acknowledgments    •  The  MSRC  consorNum  •  Greg  Riccardi,  FSU    •  Oscar  Corcho,  UPM  •  Olga  Giraldo,  UPM  •  Bob  Morris,  Harvard  University  •  Michel  DumonNer,  Carleton  University  •  Dietrich  Rebholz-­‐Schuhmann,  University  of  Zurich  •  Diane  Leiva,  FSU  •  US  DoD  Grant  MOMRP  Grant  w81xwh-­‐10-­‐2-­‐0181    •  All  of  those  who  gave  us  feedback  about  the  RDFizaNon  and  the  quality  of  our  RDF  datasets  

2/6/13  

Biotea,  R

DF4P

MC  

32  

Page 33: RDF4PMC, RDFizing! PubMedCentral! - bioontology.org · 2013-02-06 · RDF4PMC, RDFizing! PubMedCentral! Alexander)Garcia1, Leyla)Jael)García)Castro2,)Casey)McLaughlin1)) 1FloridaState)University))

Thanks  for  you  attention  

Contacts  •  Alexander  García:  [email protected]  •  L.  Jael  García  Castro:  [email protected]  

     

2/6/13  

Biotea,  R

DF4P

MC  

33