32
The CFR meets the Semantic Web (with a little unnatural language processing thrown in )

The CFR meets the Semantic Web

  • Upload
    wray

  • View
    37

  • Download
    0

Embed Size (px)

DESCRIPTION

The CFR meets the Semantic Web. (with a little unnatural language processing thrown in ). Background: a two-part history of the Semantic Web. SW is a maze of confusing buzzwords Can be thought of in two parts Pre-2005 (the “top-down” period) Post-2005 (the “bottom-up” period). SW Pre-2005. - PowerPoint PPT Presentation

Citation preview

Page 1: The CFR meets the Semantic Web

The CFR meets the Semantic Web

(with a little unnatural language processing thrown in )

Page 2: The CFR meets the Semantic Web

Background: a two-part history of the Semantic Web

•SW is a maze of confusing buzzwords

•Can be thought of in two parts

•Pre-2005 (the “top-down” period)

•Post-2005 (the “bottom-up” period)

Page 3: The CFR meets the Semantic Web

SW Pre-2005

o A fascination with inferencing & top-down analysis

o Staked out a lot of theoretical territory

o Built basic standards:

• RDF (statement encoding) : saying things about things

• OWL (modeling and inferencing): describing relationships between things -- that is, creating ontologies

Page 4: The CFR meets the Semantic Web

SW FROM 2005 to NOWo SW now seen as a big heap of statements

o Became more practical

o SKOS ( inexpensive conversion method/standard for metadata)

o Linked Data ( altruistic, like named anchors ca. 1992 )

o Could be seen -- from a library point of view -- as a new set of techniques for metadata management better suited to the Web

Page 5: The CFR meets the Semantic Web

The Semantic Web at the LII

•Tying legal information to the real world, not just itself

•Applications like:

o Improvements to existing finding aids

Table of Popular Names, , Tables I and III

Finer-grained, more expressive PTOA

o Search enhancement via term substitution and expansion

o Publication of “regulated nouns” and definitions as Linked Data

•Research-driven engineering as a practice/culture

Page 6: The CFR meets the Semantic Web

Why use the SW toolset?

•Sometimes the whole thing looks like an illustration of the Two Fool Rule

•Why RDF?

o XML is more cumbersome and less expressive

o RDF supports inferencing

o RDF allows processing of partial information

•Why SPARQL?

o um, SPARQL is how you query RDF

Page 7: The CFR meets the Semantic Web

Why use SKOS?

o it's a simple knowledge organization system

o lightweight representation of things we need a lot:

o thesauri

o taxonomies

o classification schemes

o it might be a little too simple

Page 8: The CFR meets the Semantic Web

SKOS: DRIVING INTO A DITCH

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:skos="http://www.w3.org/2004/02/skos/core#">

<skos:Concept rdf:about="http://www.my.com/#canals"> <skos:definition>A feature type category for places

such as the Erie Canal</skos:definition> <skos:prefLabel>canals</skos:prefLabel> <skos:altLabel>canal bends</skos:altLabel> <skos:altLabel>canalized streams</skos:altLabel> <skos:altLabel>ditch mouths</skos:altLabel> <skos:altLabel>ditches</skos:altLabel> <skos:altLabel>drainage canals</skos:altLabel> <skos:altLabel>drainage ditches</skos:altLabel> <skos:broader rdf:resource="http://www.my.com/#hydrographic%20structures"/> <skos:related rdf:resource="http://www.my.com/#channels"/> <skos:related rdf:resource="http://www.my.com/#locks"/> <skos:related rdf:resource="http://www.my.com/#transportation%20features"/> <skos:related rdf:resource="http://www.my.com/#tunnels"/> <skos:scopeNote>Manmade waterway used by watercraft or for drainage, irrigation, mining, or water

power</skos:scopeNote> </skos:Concept> </rdf:RDF>

Page 9: The CFR meets the Semantic Web

Data reuse: DrugBank

•Acetaminophen vs. Tylenol : CFR regulates by generic name

•DrugBank (http://www4.wiwiss.fu-berlin.de/drugbank/)

o http://www.drugbank.ca/

o Offered as Linked Data by Freie Universität Berlin

•DrugBank associates brand names with their components

•We offer component names as suggested search terms in Title 21 [*]

Page 10: The CFR meets the Semantic Web

Can't everything be done with recycled data? Um, no.

•Some datasets suck, or don´t exist yet

•Conversion of existing resources is not painless

o Many vocabularies rely on human interpretation

o Many vocabularies are not rigorous enough for SKOS encoding (lotta bad SKOS out there)

Page 11: The CFR meets the Semantic Web

CURATION ISSUES FOR EXisting Datasets

o Appropriateness, coverage, provenance

o Same metadata quality issues as usual

o Many systems of subject terms or identifiers not designed for wide exposure: the "on a horse" problem

o We’re talking about curation of vocabularies and schemas as much as we are about curation of data.

Page 12: The CFR meets the Semantic Web

LII SW features

Page 13: The CFR meets the Semantic Web

extracted vocabularies

•The big idea: enhance CFR search via term expansion, suggestion, etc.

Reuse existing thesauri

Make a CFR-specific vocabulary by discovering how the CFR talks about itself

Use that knowledge to suggest better search terms

•This is not simple phrase or n-gram matching like Google Suggest.

•Rather, we discover how words within the CFR relate to each other and we structure them into a hierarchy of terms (SKOS)

Page 14: The CFR meets the Semantic Web

Where do vocabularies come from?

•Input: text elements in the CFR XML

•Extraction and patterns:

o Anaphora resolution (JavaRAP)

o Natural Language Parser (Stanford Parser)

o Hearst patterns:

o Output: SKOS (Jena)

Page 15: The CFR meets the Semantic Web

Anaphora resolution

•John spent time in a Turkish prison. He is now the executive director of CALI.

•Núria stole Sara’s chocolate and stuffed her face with it. (but whose face was it?)

•When a sponsor conducting a nonclinical laboratory study intended to be submitted to or reviewed by the Food and Drug Administration utilizes the services of a consulting laboratory, contractor, or grantee to perform an analysis or other service, it shall notify the consulting laboratory, contractor, or grantee that the service is part of a nonclinical laboratory study that must be conducted in compliance with the provisions of this part.

Page 16: The CFR meets the Semantic Web

Stanford Parser

Structured grammar trees & typed dependencies

• Noun modifier: nn(product-10, chemical-9)

• “product skos:narrower chemical_product”

• Conjunctions: conj(doctor-7, practitioner-9)

• "doctor skos:related practitioner”

Page 17: The CFR meets the Semantic Web

Hearst Patterns

o lexico-syntactic patterns that indicate hypernymic/hyponymic relations.

o { NP (,)? (such as | like) (NP,)* (or | and) NP

o Example: All vehicles like cars, trucks, and go-karts

o PS:

o hypernym == word for superset containing term

o hyponym == more specific term

Page 18: The CFR meets the Semantic Web

principal display panel

parser understands “display” as a verb.

oops.

Page 19: The CFR meets the Semantic Web

Why is this hard?•Legal text is structurally complicated

o Parser dies on long sentences, leading to incorrect extractions

•Named entities ("Food, Drug, and Cosmetic Act") confuse the parser

o Should be separately extracted/tagged

o Parser should think of them as a single token, but doesn´t

o May need authority files for entities and acronyms, etc.

•Corpus is huge (CFR == 96.5 million words)

o Strains memory limits and computational resources

Page 20: The CFR meets the Semantic Web

Definitions: improving search and presentation

•The big idea: find all terms defined by the reg or statute, and do cool stuff with them, for example

o linking terms in text to their definitions

o pushing definitions to the top of results when the term is searched for

o altering presentation so that (legally) naive user understands the importance of definitions for, eg., compliance.

•Of course, that also means figuring out what the scope of definitions is.... :(

Page 21: The CFR meets the Semantic Web

Where do the definitions come from?

•Input: heading elements in the CFR XML with the term "definition".

•Using regular expressions, we extract

o Defined term and definition text

o Location of the definition (section of the CFR)

o Scoping information: "For the purposes of this part"

•Output: SKOS/RDF

o defined term --> SKOS Vocabulary

Page 22: The CFR meets the Semantic Web

Definitions: TOOLS

• Python Natural Language Toolkit (NLTK)

• ElementTree, XML parsing library

• Snowball Stemmer Package

• RDFlib, an RDF generation library

Page 23: The CFR meets the Semantic Web
Page 24: The CFR meets the Semantic Web

Why THiS is Hard: FINDING DEFINITIONS

o Text containing definition can make it hard to extract.

o Sponsor means:

o (1) A person who initiates and supports, by provision of financial or other resources, a nonclinical laboratory study;

o (2) A person who submits a nonclinical study to the Food and Drug Administration in support of an application for a research or marketing permit

o Pattern identification/inconsistencies in sections that are not explicitly meant to be definitions (or, what does “means” mean?)

Page 25: The CFR meets the Semantic Web

WHy this is hard: SCoping Definitions

o Scoping not stated in text, implicit in structure

o Complex scoping statements:

"The definitions and interpretations contained in section 201 of the act apply to those terms when used in this part".

"Any term not defined in this part shall have the definition set forth in section 102 of the Act (21 U.S.C. 802 ), except that certain terms used in part 1316 of this chapter are defined at the beginning of each subpart of that part".

Page 26: The CFR meets the Semantic Web

So, what can we do? [*]

Page 27: The CFR meets the Semantic Web

Improvements

o Vocabulary: better extraction and quality

o Definitions: retrieval and completeness

o Obligations: false positives, identification of parts

o Product Codes: semantic matching

Page 28: The CFR meets the Semantic Web

FUTURE WORK

o RDF-ification, refinement, implementation:

Table III, PTOA, Popular Names

Agency structure

o Data management and quality

o Crowdsourcing

Page 29: The CFR meets the Semantic Web

Resources: standards and primers

•RDF:

o Primer: http://www.w3.org/TR/rdf-primer/

o Advantages: http://www.w3.org/RDF/advantages.html

•SKOS

o http://www.w3.org/2004/02/skos/

Page 30: The CFR meets the Semantic Web

More Resources

•Linked Open Data:

o General: http://linkeddata.org/

o Tutorial: http://www4.wiwiss.fu-berlin.de/bizer/pub/linkeddatatutorial/

o Government Data: http://logd.tw.rpi.edu/

•W3C Semantic Web resources:

o http://www.w3.org/standards/semanticweb/

Page 31: The CFR meets the Semantic Web

EVEN MORE Resources: rants and raves

•VoxPop articles on the SW and Law: http://blog.law.cornell.edu/voxpop/category/semantic-web-and-law/

•Mangy dogs: http://liicr.nl/JPcAb2

•Legal Informatics blog: http://legalinformatics.wordpress.com/

•Books on law and the SW: http://liicr.nl/MGRbkA

Page 32: The CFR meets the Semantic Web

Us•Núria

o [email protected]

o @ncasellas

o http://nuriacasellas.blogspot.com

•Tom

o [email protected]

o @trbruce

o http://blog.law.cornell.edu/(tbruce | metasausage)