Supporting the authoring process with linguistic software

Preview:

DESCRIPTION

Day:1 _ Melani_Siegel_1345-1430

Citation preview

Supporting the Authoring Process with Linguistic

SoftwareMelanie Siegel

The authoring process – and where it needs support

challenges for correctness• time pressure• non-native writing• not enough capacity for careful proofreading

automatic support possibilities• spell checking• grammar checking

The authoring process – and where it needs support

challenges for understandability and readability• authors are experts of subject and language –

users often are not

automatic support possibilities

• style checking

The authoring process – and where it needs support

challenges for consistence and corporate wording• guidelines for corporate wording exist – in a large document on

the shelf• terminology lists exist – in an excel sheet somewhere in the file

system• distributed writing

automatic support possibilities• terminology checking• sentence clustering

The authoring process – and where it needs support

challenges for translatability• authors write without having the translation process in

mind• lexical, syntactic and semantic ambiguity• translation costs depend on translation memory matches

automatic support possibilities• style checking• terminology checking

tokenization POS-tagging morphology dictionary error dictionary

NLP that is needed for authoring support

• Close the door of our XYZ car.

tokenization

capital word lower word dot_EOSspace

花子が本を読んだ。

花子 が 本 を 読ん だ 。

Kanji dot_EOSHiragana

based on rules and lists of

abbreviations

Close the door of our XYZ car.V DET N PREP PRON NE N

POS tagging

XML and attributevalue structures

statistical methodslarge dictionaries

• Close the door of our XYZ car.

morphology

Lemma: closeTense: present_imp Person: third Number: singular

Lemma: carNumber: singular Case: nominative_accusative

based on dictionaries, rules for inflection and derivation

dictionary

• words unknown to the standard NLP system

http://wiki.openoffice.org/wiki/Documentation/

spelling

language analysis error analysis

words are defined in a dictionary

anything not in the dictionary is an error

high recall, low precision (depending on the domain)

errors are defined unknown words that

are not defined as errors are term candidates

based on words and rules

consider terminology high precision, recall is

dependent on data work

error dictionary

• stylesheet style sheet• begginning beginning• beleive believe• definately definitely• gotta have to• hided hid|hidden|hides

• avoid false alarms in spelling• consistency• less ambiguity• translatability• corporate wording

ultimate goal: 1 term - 1 meaning - 1 translation

why work on terminology?

• web server – web-server• upload protection – upload-protection• timeout – time out• Reset – ReSet• sub station – sub-station

reality: variants

– orthographic variants- hyphen, blank, case: term bank, termbank

– semi-orthographic variants - number : 6-digit, six-digit- trademark : MyCompany™, MyCompany

– syntactic variants - preposition: oil level, level of oil- gerund/noun : call center, calling center

– synonyms “classical” : vehicle, car

– language-specific variants(e.g. Fugenelemente DE, Katakana JA)

term variants

• author/company defines the term bank

• list of deprecated terms

deprecated term: vehicleapproved term: car

• list of approved terms automatic identification of variants

approved term: SWASSNet Userdeprecated term: SWASSNet user, SWASS-Net User

how to get consistent terminology

terminology and spelling

terminology and spelling

NLP for terminology

• NLP methods for term extraction– corpus analysis (morphology, POS, NER)– information extraction (potential product names)– ontologies (e.g. semantic groups)

• NLP methods for setting up a term database– morphology (finding the base form)– POS

• NLP methods for term checking– variants– similar words– inflection

approaches to grammar checking

descriptive grammar

• definition of correct grammar• e.g. HPSG, LFG, chunk-grammar,

statistical grammars• anything that‘s not analyzable

must be a grammar error• preconditions:• grammar with large coverage• large dictionaries• robust, but not too robust

parsing • efficient parsing methods

• high recall, low precision

error grammar

• implementation of grammar errors• preconditions:• work with error corpora• error grammar with a high

number of error types• „deepness“ of analysis varies

with the type of error to be described

• high precision, recall is based on the number of rules

• subject verb agreement:– Check if instructions are programmed in

such a way that a scan never finish.–When the operations is completed, the

return to home completes.

grammar rules, examples

grammar rules, examples

• a an distinction:– a isolating transformer – an program

• wrong verb form:– it cannot communicates with them – IP can be automatically get

• write_words_together

– @can ::= [ TOK "^(can)$"– MORPH.READING.MCAT "^Verb$" ];

– The application can not start.– The application can tomorrow not start.

– TRIGGER(80) == @can^1 [@adv]* 'not'^2– -> ($can, $not)– -> { mark: $can, $not;– suggest: $can -> '', $not -> 'cannot';– }

– Branch circuits can not only minimize system damage but can interrupt the flow of fault current

– NEG_EV(40) == $can 'not' 'only' @verbInf []* 'but';

example grammar rule*

* implemented in Acrolinx

• controlled languages

• AeroSpace and Defence Industries Association of Europe (ASD)ASD-STE100 (simplified English)

• Caterpillar Technical English (CTE)

• disadvantages:

• very restrictive

• low acceptance of users

style - controlled language

• rules define errors (like grammar rules)• rules (and instructional information) are

defined by authors• implementation in authoring support systems• high acceptance• good usability

style – moderately controlled language

• different for different usages– text type

• (e.g., press release – technical documentation)

– domain • (e.g., software – machines)

– readers • (e.g., end users – service personnel)

– authors • (e.g., Germans tend to write long sentences)

style guidelines

•avoid_latin_expressions

•avoid_modal_verbs

•avoid_passive

•avoid_split_infinitives

•avoid_subjunctive

•use_serial_comma

•use_comma_after_introductory_phrase

•spell_out_numerals

style rule examples*: best practise

*style rule implemented in Acrolinx

•use_units_consistently

•abbreviate_currency

•COMPANY_trademark

•do_not_refer_to_COMPANY_intranet

•add_tag_to_UI_string

•avoid_trademark_as_noun

•avoid_articles_in_title

style rule examples: company

•avoid_nested_sentences

•avoid_ing_words

•keep_two_verb_parts_together

•avoid_parenthetical_expressions

dependent of MT system and language pair

style rule examples MT preediting

– replacement of words or phrases– replacement using the correct writing with

uppercase or lowercase– replacement of words using the correct inflection– generation of whole sentences (e.g. passive –

active) requires semantic analysis and generation and is therefore not (yet) possible

automatic suggestions for style rules

• avoid_future_tense

• /* Example: „.. It will be necessary .." */

• TRIGGER (80) == @will^1 [-@comma]* @verbInf^2 • ->($will, $verbInf)• -> { mark : $will, $verbInf;}

• /* Example: „.. The router services will be offered in the future .." */

• NEG_EV(40) == $will []* @in @det @time;

example style rule*

* implemented in Acrolinx

• Use the same phrase for the same meaning.

• Examples:– Congratulations on acquiring your new wearable digital

audio player– Congratulations, you have acquired your new wearable

digital audio player!– Dear Customer, congratulations on purchasing the new

wearable digital audio player!

consistent phrasing

Acrolinx server

Terminology

Intelligent Reuse

Grammar&

Spelling

WritingStandards

Acrolinx intelligent reuse™

Reuse Repository

Clustersmicro-clustering

redundancy and quality filters

review and release

Content / Translationrepository

the cat sat on the matThe dog sat on the rugThe elk sat on the mossThe moose sat on the elk

the cat sat on the carpetThe cat slept on the sofa

Fish swam in the blue waterThe fish swam in the green waterThe fish swam in the red sea.

the cat sat on the matthis is a sentence you can’t read

the cat sat on the matAnother small test snippetthe cat sat on the matThis is the same as the other one.the cat sat on the mat

the cat sat on the maltThe cat ate on the matthe cat sat on the doormat

the cat sat on the mat.The cat sat on the matthe cat sat on the mat

the cat sat on the matMore useless data points

DEMO

checking OpenOffice documentation

correctness

understandability

consistency

consistency

translatabiliy

summary

• The authoring process is challenging– correctness– consistency– understandability– translatability

• It can be effectively supported by NLP-enhanced tools

Thank you!

Melanie SiegelHochschule Darmstadt – University of Applied Sciences

melanie.siegel@h-da.de

Recommended