Apache Stanbol 
and the Web of Data - ApacheCon 2011

  • Published on

  • View

  • Download


Presentation on Apache Stanbol (incubating) and related projects given by Olivier Grisel durin ApacheCon 2011. More information: - http://incubator.apache.org/stanbol/ - http://www.iks-project.eu


  • 1. Apache Stanbol (Incubating) and the Web of Data Olivier Grisel, Nuxeo ogrisel@apache.org, 2011-11-11 11/7/11

2. My Background 11/7/11 Olivier Grisel - R&D Engineer nuxeo Open Source ECM European project: IKS Stuff I do: Machine LearningNatural Language Processing All things data 3. Agenda 11/7/11 The Web of Data: what, why, how? CMS integration demo Semantic Components in Stanbol Building models for Stanbol 4. The Web of Data What, Why, How? 5. 6. 11/7/11 To a computer, then, the web is a flat , boring world devoid of meaning Tim Berners Lee,http://www.w3.org/Talks/WWW94Tim/ 7. 11/7/11 This is a pity, as in fact documents on the web describe real objects and imaginary concepts , and give particular relationships between them Tim Berners Lee,http://www.w3.org/Talks/WWW94Tim/ 8. 11/7/11 The Semantic Web is not a separate Web but anextensionof the current one, in which information is given well-definedmeaning , better enablingcomputers and peopleto work in cooperation. Tim Berners Lee,http://www.w3.org/Talks/WWW94Tim/ 9. 11/7/11 Adding semantics to the web involves two things: allowing documents which have information in machine-readable forms, and allowing links to be created with relationship values . Tim Berners Lee,http://www.w3.org/Talks/WWW94Tim/ 10. The Web of Data What? 11/7/11

    • Shared description of the real world
      • Structured with vocabularies
      • Decentralized
      • Scoped by namespaces
      • Linked

11. The Web of Data Why? 11/7/11

    • Strings are ambiguous
      • New York / The Big Apple / NYC
      • Washington (Person, State, City, Sports Team...)
    • Structured context helps humans
      • Who is this guy?
      • Where is this city?
    • Conceptual frame helps machines
      • Explicit user intent decoding
      • Smarter indexing / search?

12. Decoding User Intents 11/7/11 13. Decoding User Intents 11/7/11 Next Generation User Interfaces Siri - conversational interface IBM DeepQA: Watson for Heath Care Tell Google about your stuff Publish structured prediction of your products "3 bedrooms flat near Montmartre" Useful for non-public data as well Intranet query: "ApacheCon slides" Intranet query:"Xeroxinvoices" Intranet query:"Xerox salesperson email" 14. The Web of Data - How? 11/7/11 RDF / TripeStores / Sparql Graph stores with dynamic schemas Strong interoperability JSON-LD Upgrade your JSON with scoped vocabularies Web / Mobile / JS developer friendly RDFa + schema.org & rNews Publish annotation in structured markup Vocabulary understood by Search Engines 15. HTML example 11/7/11

My name is Manu Spornyand you can give me a ring via 1-800-555-0155. I have a blog.

16. RDFa example 11/7/11 My name is Manu Sporny and you can give me a ring via 1-800-555-0155. I have a blog. 17. JSON-LD example 11/7/11 18. 11/7/11 2007 2008 2009 2010 19. 2011 20. Bridging the Web of Data and my CMS 21. 22. Apache Stanbol 11/7/11 Enhancer Text analysis with Apache OpenNLP / Tika EntityHub /ContentHub Linked Data Indexing with Apache Solr Graph Storage with Apache Clerezza / Jena Reasoner / Rules Inference with Apache Jena & OWLApi Components / HTTP Services OSGi with Apache Felix / JAX-RS with Jersey 23. 24. 25. 26. 27. RESTful is Beautiful 28. Minimalist HTTP Client 11/7/11 curl -X POST -H "Accept: text/turtle"-H "Content-type: text/plain"--data "John Smith was born in London."http://stanbol.demo.nuxeo.com/engines 29. 30. 31. Local IT infrastructure (LAN) Nuxeo DM addon 1 1 Apache Stanbol 1 2 1 Engine 1 Engine 2 Engine 3 3 DBpedia Freebase Geonames LDAP 32. Stanbol Enhancer 11/7/11 Chain of Enhancement Engines Language Detection (Tika) Named Entity Detection (OpenNLP) Linked Data dereferencing (Solr) Refactoring / Translation (Jena) 33. Stanbol EntityHub 11/7/11 Referenced Sites DBpedia Geonames (NY Times, MusicBrainz, ProductDB, UnitProt...) Fast local offline indices (Solr) Batch indexing utilities for RDF dumps Multilingual fulltext search in labels & descriptions Vocabulary mapping / merging 34. Stanbol Reasoner 11/7/11 RDFS / OWL-lite / OWL2 Consistency checks Cardinality checks: each person has 1 birth dateRange constraints: birth dates are valid dates Materializing types / properties Types from subclass: Musician > Artist > Person Symmetric property:A worked with B Transitive property:A is a located in B Query-time expansion / inference? 35. Stanbol Rules 11/7/11 Simple Prolog-like language uncleRule[ has(, ?x, ?z) . has(, ?z, ?y) -> has(, ?x, ?y) ] Sparql Construct or SWRL PREFIX family: CONSTRUCT { ?x family:hasUncle} ?y } WHERE { ?x family:hasParent ?z .?z family:hasSibling ?y} 36. Online Demos 11/7/11 Simple analyzer with small index https://stanbol.demo.nuxeo.com All services deployed http://dev.iks-project.eu:8081 37. Building Stanbol Enhancer models from Wikipedia with the Apache data tools 38. Universal Topic Classification 11/7/11 Use Apache Lucene / Solr MoreLikeThis to perform a truncated nearest neighbors query in the TF-IDF vector space of Wikipedia 39. Universal Topic Classification 11/7/11 Index text of all articles grouped by topicSolr MoreLikeThis query on new document DBpedia dumps provide: Text summaries for each article subject relationships between articles and topics broader / narrower SKOS hieararchy between topics 40. About the Data 11/7/11 500k purely technical categories People_with_missing_birth_place,Rivers_in_Romania 70k semantically grounded categories Paths to roots require both technical and grounded categories Scale: 1.2M topic / topic links 30M topic / article links 41. Some results (Wikinews) 11/7/11 US children who celebrate Independence Day more likely to become Republicans, says Harvard study Fireworks Voting theory Republican Party (United States) Statistics Electoral systems 42. Some results (Wikinews) 11/7/11 U.S. space agency NASA sues ex-astronaut American astronauts Aviation halls of fame Edwards Air Force Base Apollo program Exploration of the Moon 43. Some results (Wikinews) 11/7/11 Hundreds of thousands of British public sector workers strike over planned pension changes Retirement in the United Kingdom United Kingdom pensions and benefits Pensions in the United Kingdom Labor disputes by country Labor disputes 44. Some results (PLoS One) 11/7/11 Metabolic Programming during Lactation Stimulates Renal Na+ Transport in the Adult Offspring Due to an Early Impact on Local Angiotensin II Pathways Renal physiology Kidney Nephrology Hypertension Membrane biology 45. Wrap Up 11/7/11 Web of Data bringsSructured Context Frame to decode User Intention NLP+Entities & Topicsindices to automateContent Enrichment to provideDisambiguationn 46. Resources 11/7/11 Documentation, svn, mailing list: http://incubator.apache.org/stanbol IKS project blog: http://blog.iks-project.eu Blog posts about Semantic ECM: http://blogs.nuxeo.com/dev/semantic/ 47. Thank you for your attention! 11/7/11 Olivier Grisel [email_address] https://twitter.com/ogrisel 48. Training models for NER from Wikipedia Extractsentences with link positionsin Wikipedia articles DBPedia to thefind type of the target entity(Person, Location, Organization) Apache Pig scriptsto compute thejoin + formatthe result as training files for OpenNLP Apache OpenNLPto build and evaluate the models Apache Hadoop / Apache Whirrfor distributed processing 49. 50. 51. 52.