51
Natural Language Processing Techniques for Managing Legal Resources Managing Legal Resources on the Semantic Web European University Institute Fiesole, Italy September 11, 2009 Adam Wyner University College London [email protected]

Natural Language Processing Techniques for Managing Legal

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Natural Language Processing Techniques for Managing Legal

Natural Language Processing Techniques for Managing Legal Resources

Managing Legal Resources on the Semantic Web

European University InstituteFiesole, Italy

September 11, 2009

Adam Wyner

University College London

[email protected]

Page 2: Natural Language Processing Techniques for Managing Legal

Main Point

Legal text expressed in natural language can be automatically annotated with semantic mark ups using natural language processing systems such as the General Architecture for Text Engineering (GATE).

Page 3: Natural Language Processing Techniques for Managing Legal

Overview

• Motivation and objectives of NLP in this context.

General Architecture for Text Engineering (GATE).

Processing and marking up text.

Another technology for parsing and semantic interpretation (C&C/Boxer).

Other approaches.

Page 4: Natural Language Processing Techniques for Managing Legal

Motivations

• Annotate large legacy corpora.

• Address growth of corpora.

• Reduce number of human annotators and tedious work.

• Make annotation systematic and automatic.

• Annotate fine-grained information:

• Names, locations, addresses, web links, organisations, actions, argument structures, relations between entities....

• Map from well-drafted documents in NL to RDF/OWL.

Page 5: Natural Language Processing Techniques for Managing Legal

Motivations

• Top-down vs. Bottom-up approaches:

• Both do initial (and iterative) analysis of the texts in the target corpora.

• Top-down defines the annotation system, which is applied manually to texts. Knowledge intensive in development and application.

• Annotation system is „defined‟ in terms of parsing, lists of basic components, ontologies, and rules to construct complex mark ups from simpler one. Apply the annotation system to text, which outputs annotated text. Knowledge intensive in development.

• Convergent/complementary/integrated approaches.

• Bottom-up reconstructs and implements linguistic knowledge. However, there are limits....

Page 6: Natural Language Processing Techniques for Managing Legal

Objectives of NLP

• NLP – automated processing of natural language.

• Generation – convert information in a database into natural language.

• Understanding – convert natural language into a machine readable form.

• Range of subtasks (focusing on text):

• Segment text (words, phrases, sentences, paragraphs, sections,....).

• Morphological analysis (plural/singular, tense,....).

• Tag each word for part of speech in context (noun, verb, adjective, number,....).

Page 7: Natural Language Processing Techniques for Managing Legal

Objectives of NLP

• Range of subtasks:

• Syntactic parsing into phrases/chuncks (prepositional, nominal, verbal,....).

• Identify semantic roles (agent, patient,....).

• Entity recognition (organisations, people, places,....).

• Resolve pronominal anaphor and co-reference.

• Address ambiguity.

Page 8: Natural Language Processing Techniques for Managing Legal

Objectives of NLP

• NLP useful for:

• Mark up documents in a large corpora.

• Automatic mark up to overcome bottleneck.

• Semantic representation for modelling and inference.

• Semantic representation as a „interlanguage‟ for translation.

• To understand and work with human language capabilities.

Page 9: Natural Language Processing Techniques for Managing Legal

Objectives of NLP

Develop annotations, ontologies, and gold-standard corpora.

Semantically annotated texts support activities such as:

Maintenance, presentation, and navigation.

Information extraction (find patterns -- words or statements -- among documents).

Translation

Query (find all individuals who did a particular action).

Inference.

Page 10: Natural Language Processing Techniques for Managing Legal

Reminder

• Presentations on acquisition of ontologies using NLP.

• Ontology design patterns with natural language „tie ins‟.

• WordNet and Framenet.

• The analysis cycle:

• Text -> Linguistic Analysis -> Knowledge Extraction -> Structural Content

• Cycle between Linguistic Analysis and Knowledge Extraction to improve the final Structural Content.

• Computational linguistic analysis “layer cake”.

Page 11: Natural Language Processing Techniques for Managing Legal

Current State at OPSI, UK

• Office of Public Sector Information, United Kingdom

• Want to develop and leverage public information.

• http://www.opsi.gov.uk/

• The Stationary Office, which have used GATE to develop automated mark up for OPSI, have not (yet) made marked up documents or processes available. Public vs. Private development.

• NLP for legislation is not an academic exercise.

• Applications?

Page 12: Natural Language Processing Techniques for Managing Legal

The Crown XML Schema for Legislation

Page 13: Natural Language Processing Techniques for Managing Legal

Terrorism Act 2000 (1.0)

Page 14: Natural Language Processing Techniques for Managing Legal

Terrorism Act 2000 (1.1)

Page 15: Natural Language Processing Techniques for Managing Legal

Terrorism Act 2000 (1.2)

Page 16: Natural Language Processing Techniques for Managing Legal

Terrorism Act 2000 (2.0)

Page 17: Natural Language Processing Techniques for Managing Legal

Terrorism Act 2000 (2.1)

Page 18: Natural Language Processing Techniques for Managing Legal

Content in Notices

Not

glamorous,

but useful.

RuleBurst.

Page 19: Natural Language Processing Techniques for Managing Legal

Content in Notices

Page 20: Natural Language Processing Techniques for Managing Legal

GATE

• General Architecture for Text Engineering (GATE) open source framework which supports plug in NLP components to process a corpus of text. Is “open” open?

• Where to get it?

• http://gate.ac.uk/

• Components and sequences of processes, each process feeding the next in a “pipeline”.

• Annotated text output.

• Example of a case with screen shots.

Page 21: Natural Language Processing Techniques for Managing Legal

GATE

References:

• “Building Search Applications: Lucene, LingPipe, and Gate” by Manu Konchady, 2008.

• “Introduction to Linguistic Annotation and Analytics Technologies” by Graham Wilcock, 2009

Page 22: Natural Language Processing Techniques for Managing Legal

GATE

• Language Resources: lexicons, corpora, ontologies.

• Processing Resources: parsers, generators, taggers.

• Visual Resources: visualisation and editing.

• The resources are plug ins, so can be added or taken away.

• Document = text + annotations + features

• <Person, gender = “male”>John Smith</Person>

• <Verb, tense = “past”>ran</Verb>

Page 23: Natural Language Processing Techniques for Managing Legal

GATE

• Computational linguistic analysis “layer cake”:

• Sentence segmentation

• Tokenisation (words identified by spaces between them).

• Morphological analysis (singular/plural, tense, nominalisation, ..., range of parts of speech such as noun, verb, adjective, ...).

• Part of speech tagging (noun or verb given other words nearby).

• Shallow syntactic parsing/chunking (noun phrase, verb phrase, subordinate clause, ...).

• Dependency analysis (subordinate clauses, pronominal anaphora,...).

• Pattern matching and rule application.

Page 24: Natural Language Processing Techniques for Managing Legal

GATE

• Lists:

• List of verbs: like, run, jump, ....

• List of common nouns: dog, cat, hamburger, ....

• List of proper names: Cyndi, Bill, Lisa, ....

• List of determiners: the, a, two, ....

• Rules:

• (Determiner + Common Noun) | Proper Name => Noun Phrase

• Verb + Noun Phrase => Verb Phrase

• Noun Phrase + Verb Phrase => Sentence

• Output:

• [s [np Cyndi] [vp [v likes] [np [det the] [cn dog]]]].

Page 25: Natural Language Processing Techniques for Managing Legal

GATE Offset

Annotations are:

tokens (offsets of

text from start

space to end

space) along with

type/features

which have a

name or value.

Page 26: Natural Language Processing Techniques for Managing Legal

GATE Annotations

Partial. Missing namespace and type needed

for full definition.

Page 27: Natural Language Processing Techniques for Managing Legal

GATE Annotations

Page 28: Natural Language Processing Techniques for Managing Legal

GATE

Construction:

From smaller units, compose larger, derivative units.

Gazetteers:

Lists of words (or abbreviations) that fit an annotation: first names, street locations, organizations....

JAPE (Java Annotation Patterns Engine):

Build other annotations out of previously given/defined annotations. Use this where the mark up is not given by a gazetteer. Rules have a syntax.

Page 29: Natural Language Processing Techniques for Managing Legal

GATE Gazetteers

Page 30: Natural Language Processing Techniques for Managing Legal

GATE Organisation Gazetteer

Page 31: Natural Language Processing Techniques for Managing Legal

GATE JAPE

JAPE idea (here with mark up, but could be some feature).

<FirstName>aaaa</FirstName><LastName>bbbb</LastName>

=> <WholeName><FirstName>aaaa</FirstName>

<LastName>bbbb</LastName></WholeName>

FirstName and LastName we get from the

Gazetteer. WholeName we construct using the rule.

For complex constructions, must have a range of

alternatives.

Page 32: Natural Language Processing Techniques for Managing Legal

GATE JAPE

Page 33: Natural Language Processing Techniques for Managing Legal

GATE JAPE

Page 34: Natural Language Processing Techniques for Managing Legal

GATE JAPE

Page 35: Natural Language Processing Techniques for Managing Legal

GATE Example

Page 36: Natural Language Processing Techniques for Managing Legal

GATE Example

Page 37: Natural Language Processing Techniques for Managing Legal

GATE Example

Page 38: Natural Language Processing Techniques for Managing Legal

GATE Example

Organisations and Quotations. Case references.

Page 39: Natural Language Processing Techniques for Managing Legal

GATE XML

Page 40: Natural Language Processing Techniques for Managing Legal

Other GATE Components

• Develop an ontology, import it into GATE, then mark up elements manually.

• Use the ontology in writing the JAPE rules.

• Plug in other parsers, create gazetteers, work with other languages....

• Machine learning component.

• Have not discussed mark up for metadata, structure, or presentation (see de Maat, Winkels, and van Engers).

• Work to develop gazetteers and JAPE rules.

Page 41: Natural Language Processing Techniques for Managing Legal

GATE – Problems and Issues

• Any difference in the characters of the basic text or in annotations is an absolute difference

• theatre and theater are different strings for entities. Variants in Gazetteers.

• Organisation and Organization are different annotations.

• Output in XML is possible, but GATE mark up allows overlapping tags, which are barred in standard XML. Must rework GATE XML with XSLT to make it standard XML.

• Accuracy is not 100% for a variety of reasons, but it can be 80-95%.

Page 42: Natural Language Processing Techniques for Managing Legal

C&C/Boxer – Motivations and Objectives

• Fine-grained syntactic parsing – can identify not only parts of speech, but grammatical roles (subject, object) and phrases (e.g. verb plus direct object is verb phrase).

• Contributes to NL to RDF/OWL translation – individual entities, data and object properties?

• Input to semantic interpretation in FOL – test for consistency, support inference, allow rule extraction.

Page 43: Natural Language Processing Techniques for Managing Legal

C&C/Boxer

• C & C is a combinatorial categorial grammar.

• Boxer provides a semantic interpretation, given the parse. The semantic interpretation is a form of first order logic –discourse representation theory.

• Needs some manipulation. Parser outputs the „best‟ parse, but that might not be what one wants; the semantic representation might need to be selected.

• Try it out at:

• http://svn.ask.it.usyd.edu.au/trac/candc

• Various representations – C&C, Graphic, XML Parse, Prolog.

Page 44: Natural Language Processing Techniques for Managing Legal

C&C/Boxer

Page 45: Natural Language Processing Techniques for Managing Legal

C&C/Boxer

Vx [ man’(x) -> happy’(x)]

Page 46: Natural Language Processing Techniques for Managing Legal

If Bill is rich and healthy, then he is happy

Page 47: Natural Language Processing Techniques for Managing Legal

If Bill is rich and healthy, then he is happy.

Page 48: Natural Language Processing Techniques for Managing Legal

A More Complex Example

A person commits an offence if he invites another to provide

money or other property and intends that it should be used,

or has reasonable cause to suspect that it may be used, for

the purposes of terrorism. From UK “Terrorism Act 2000,

Interpretation, Terrorist Property” (Partial parse image).

Page 49: Natural Language Processing Techniques for Managing Legal

A More Complex Example

Page 50: Natural Language Processing Techniques for Managing Legal

Other Topics

• Controlled Languages

• An expressive subset of grammatical constructions and lexicon.

• Guided in put so only well-formed, unambiguous expressions.

• Translation to FOL.

• Machine Learning

• Annotating a set of documents to make a „gold standard‟.

• Train the system on the gold standard and unannotateddocuments.

• Test accuracy and adjust.

• No information on how the algorithm works.

Page 51: Natural Language Processing Techniques for Managing Legal

Conclusion

• Different approaches to mark up.

• Burdens of initial analysis, coding, and labour.

• Top-down is far ahead of bottom-up, but this is a matter of focus of research effort.

• Converging, complementary, integrated approaches.

• Potential to enrich annotations further for finer-grained information.