20
Introduction to ANNIE Diana Maynard University of Sheffield March 2004 http:// gate.ac.uk / http:// nlp.shef.ac.uk /

Introduction to ANNIE

  • Upload
    nishan

  • View
    50

  • Download
    0

Embed Size (px)

DESCRIPTION

Introduction to ANNIE. http://gate.ac.uk/ http://nlp.shef.ac.uk/. Diana Maynard University of Sheffield March 2004. What is ANNIE?. ANNIE is a vanilla information extraction system comprising a set of core PRs: Tokeniser Sentence Splitter POS tagger Gazetteers - PowerPoint PPT Presentation

Citation preview

Page 1: Introduction to ANNIE

Introduction to ANNIE

Diana MaynardUniversity of Sheffield

March 2004

http://gate.ac.uk/ http://nlp.shef.ac.uk/

Page 2: Introduction to ANNIE

What is ANNIE?

• ANNIE is a vanilla information extraction system comprising a set of core PRs:

– Tokeniser– Sentence Splitter– POS tagger– Gazetteers– Semantic tagger (JAPE transducer)– Orthomatcher (orthographic coreference)

Page 3: Introduction to ANNIE

ANNIE Pipeline

Page 4: Introduction to ANNIE

Other Processing Resources

• There are also lots of additional processing resources which are not part of ANNIE itself but which come with the default installation of GATE– Gazetteer collector– PRs for Machine Learning– Various exporters– Annotation set transferetc….

Page 5: Introduction to ANNIE

Creating a new application from ANNIE

• Typically a new application will use most of the core components from ANNIE

• The tokeniser, sentence splitter and orthomatcher are basically language, domain and application-independent

• The POS tagger is language dependent but domain and application-independent

• The gazetteer lists and JAPE grammars may act as a starting point but will almost certainly need to be modified

• You may also require additional PRs (either existing or new ones)

Page 6: Introduction to ANNIE

Modifying gazetteers

• Gazetteers are plain text files containing lists of names• Each gazetteer set has an index file listing all the lists,

plus features of each list (majorType, minorType and language)

• Lists can be modified either internally using Gaze, or externally in your favourite editor

• Gazetteers can also be mapped to ontologies• To use Gaze and the ontology editor, you need to

download the relevant creole files

Page 7: Introduction to ANNIE

JAPE grammars

• A semantic tagger consists of a set of rule-based JAPE grammars run sequentially

• JAPE is a pattern-matching language • The LHS of each rule contains patterns to be

matched• The RHS contains details of annotations (and

optionally features) to be created• More complex rules can also be created

Page 8: Introduction to ANNIE

Input specifications

• The head of each grammar phase needs to contain certain information– Phase name– Inputs– Matching style

e.g.

Phase: locationInput: Token Lookup NumberControl: appelt

Page 9: Introduction to ANNIE

Matching algorithms and Rule Priority

• 3 styles of matching:– Brill (fire every rule that applies)– First (shortest rule fires)– Appelt (use of priorities)

• Appelt priority is applied in the following order– Starting point of a pattern– Longest pattern– Explicit priority (default = -1)

Page 10: Introduction to ANNIE

NE Rule in JAPE

Rule: Company1 Priority: 25 ( ( {Token.orthography == upperInitial} )+ //from tokeniser {Lookup.kind == companyDesignator} //from gazetteer lists ):match --> :match.NamedEntity = { kind=company, rule=“Company1” }

Page 11: Introduction to ANNIE

LHS of the rule

• LHS is expressed in terms of existing annotations, and optionally features and their values

• Any annotation to be used must be included in the input header

• Any annotation not included in the input header will be ignored (e.g. whitespace)

• Each annotation is enclosed in curly braces• Each pattern to be matched is enclosed in round

brackets and has a label attached

Page 12: Introduction to ANNIE

Macros

• Macros look like the LHS of a rule but have no label

Macro: NUMBER(({Digit})+)

• They are used in rules by enclosing the macro name in round brackets

( (NUMBER)+):match

• Conventional to name macros in uppercase letters• Macros hold across an entire set of grammar phases

Page 13: Introduction to ANNIE

Contextual information

• Contextual information can be specified in the same way, but has no label

• Contextual information will be consumed by the rule

({Annotation1})

({Annotation2}):match

({Annotation3})

Page 14: Introduction to ANNIE

RHS of the rule

• LHS and RHS are separated by • Label matches that on the LHS• Annotation to be created follows the label

(Annotation1):match

:match.NE = {feature1 = value1, feature2 = value2}

Page 15: Introduction to ANNIE

Using phases

• Grammars usually consist of several phases, run sequentially

• Only one rule within a single phase can fire• Temporary annotations may be created in early phases

and used as input for later phases• Annotations from earlier phases may need to be

combined or modified• A definition phase (conventionally called main.jape) lists

the phases to be used, in order• Only the definition phase needs to be loaded

Page 16: Introduction to ANNIE

More complex JAPE rules

• Any Java code can be used on the RHS of a rule• This is useful for e.g. feature percolation,

ontology population, accessing information not readily available, comparing feature values, deleting existing annotations etc.

• There are examples of these in the user guide and in the ANNIE NE grammars

• Most JAPE rules end up being complex!

Page 17: Introduction to ANNIE

Using JAPE for other tasks

• JAPE grammars are not just useful for NE annotation

• They can be a quick and easy way of performing any kind of task where patterns can be easily recognised and a finite-state approach is possible, e.g. transforming one style of markup into another, deriving features for the learning algorithms

Page 18: Introduction to ANNIE

Example rule for deriving features

Rule: Entity( {Gpe}| {Organization}| {Person}| {Location}| {Facility}

):entity-->{gate.AnnotationSet entityAS =

(gate.AnnotationSet)bindings.get("entity");gate.Annotation entityAnn = (gate.Annotation)entityAS.iterator().next();gate.FeatureMap features = Factory.newFeatureMap();features.put("type", entityAnn.getType());

outputAS.add(entityAnn.getStartNode(), entityAnn.getEndNode(), "Entity“, features);

}

Page 19: Introduction to ANNIE

Finding Examples

• ANNIE for default NE rules:gate/src/gate/resources/creole/NEtransducer/NE/• MUSE for more complex NE rules:muse/src/muse/resources/grammar/main• h-TechSight for ontology population:htechsight/application/grammar• Various other applications generally follow the format: projectname/application/grammar/

Page 20: Introduction to ANNIE

Conclusion

This talk: http://gate.ac.uk/sale/talks/annie-tutorial.ppt

More information: http://gate.ac.uk/