BIOI 7791 Projects in bioinformatics Spring 2005 March 22 © Kevin B. Cohen

Preview:

Citation preview

BIOI 7791 Projects in bioinformaticsSpring 2005

March 22

© Kevin B. Cohen

PGES upregulates PGE2 production in human thyrocytes

(GeneRIF: 12145315)

• Syntax: what are the relationships between words/phrases?

• Parsing: figuring out the structure– Full parse– Shallow parse

• Shallow parse• Partial parse• Syntactic chunking

Full parse

PGES upregulates PGE2 production in human thyrocytes

Shallow parse

PGES

upregulates

PGE2 production

in

human thyrocytes

NounGroup

VerbGroup

NounGroup

NounGroup

PrepositionalGroup

Shallow vs. full parsing

• Different depths– Full parse goes down to level of individual

words– Shallow parse doesn’t go down any further

than the base phrase

• Different “heights”– Full parse goes “up” to root node– Shallow parse doesn’t (generally) go further

up than base phrase

Shallow vs. full parsing

• Different number of levels of structure– Full parse has many levels– Shallow parse has far fewer

Shallow vs. full parsing

• Either way, you need POS information…

POS tagging: why you need it

• All syntax is built on it

• Overcome sparseness problem by abstracting away from specific words

• Help you decide how to stem

• Potential basis for entity identification

What “POS tagging” is

• POS: part of speech

• School: 8 (noun, verb, adjective, interjection…)

• Real life: 40 or more

How do you get from 8 to 80?

• Noun • NN (noun, singular or mass)• NNS (plural noun)• NNP (proper noun)• NNPS (plural proper noun)

How do you get from 8 to 80?

• Verb • VB (base form)• VBD (past tense)• VBG (gerund)• VBN (past participle)• VBP (singular present-tense non-3rd-

person)• VBZ (3rd-person singular present

tense)

Others that are good to recognize

• Adjective • JJ (adjective)• JJR (comparative

adjective)• JJS (superlative

adjective)

Others that are good to recognize

• Coordinating conjunctions

• Determiners• Prepositions• To• Punctuation

• CC

• DT• IN• TO• , (comma)• . (sentence-final)• : (sentence-medial)

POS tagging

• Definition: assigning POS “tags” to a string of tokens

• Input: – string of tokens– tag set

• Output:– Best tag for each token

How do you define noun, verb, etc.?

• Semantic: – “A noun is a person,

place, or thing…”– “A verb is…”

• Distributional characteristics:– “A noun can take the

plural and genitive morphemes”

– “A noun can appear in the environment All of my twelve hairy ___ left before noon”

Why’s it hard?

Time flies/VBZ like/IN an arrow, but fruit flies/NNS like/VBP a banana.

POS tagging: rule-based

1. Assign each word its list of potential parts of speech

2. Use rules to remove potential tags from the list

The EngCG system:

• 56,000-item dictionary

• 3,744 rules

Note that all taggers need a way to deal with unknown words (OOV or “out-of-vocabulary”).

As always, (about) two approaches….

• Rule-based

• Learning-based

An aside: tagger input formatsapoptosis in a human tumor cell line .

apoptosis/NN in/IN a/DT human/JJ tumor/NN cell/NN line/NN ./.

apoptosis

in

a

human

tumor

cell

line

.

NN

IN

DT

JJ

NN

NN

NN

.

Just how ambiguous is natural language?

• Most English words are not ambiguous…

• …but, many of the most common ones are.

• Brown corpus: only 11.5% of word types ambiguous…

• …but > 40% of tokens ambiguous.Dictionary doesn’t give you a good estimate of the problem space…

…but corpus data does.

Empirical question: how ambiguous is biomedical text?

A statistical approach: TnT

• Second-order Markov model

• Smoothing by linear interpolation of ngrams• λ estimated by deleted interpolation• Tag probabilities learned for word endings; used

for unknown words

TnT

• Ngram: an n-tag or n-word sequence• N = 1

– DET– NOUN– role

• Bigrams– DET NOUN– NOUN PREPOSITION– a role

• Trigrams

The Brill Tagger

The Brill tagger

• Uses rules

• …but, set of rules are induced.

The Brill tagger

• Iterative error reduction1. Assign most common tags, then

2. Evaluate performance, then

The Brill tagger

• Iterative error reduction1. Assign most common tags, then

2. Evaluate performance, then

3. Propose rules to fix errors

4. Evaluate performance, then

5. If you’ve improved, GOTO 3, else END

The Brill tagger

• Change Determiner Verb “of”

• …to…

• Determiner Noun “of”

The/Determiner running/Verb of/IN

The/Determiner running/Noun of/IN

An aside: evaluating POS taggers

• Accuracy• Confusion matrix• How hard is the task? Domain/genre-

specific…– Baseline– Ceiling– State of the art:

• 96-97% total accuracy• Lower for non-punctuation

Give each word its most common tag

Interannotator agreement

--usually high 90’s

Low 90’s on some corpora!

Confusion matrix

JJ NN VBD

JJ -- .6 4.6

NN .5 --

VBD 5.4 .01 --

Columns = tagger output

Rows = right answer

An aside: unknown words

• Call them all nouns

• Learn most common POS from training data

• Use morphology

• Suffix trees

• Other features, e.g. hyphenation (JJ in Brown; biomed?), capitalization…

POS tagging: extension(s)

• Entity identification

• What else??

• First step in any POS tagging effort: – Tokenization– …maybe sentence segmentation

First programming assignment: tokenization

• What was hard?

• What if I told you that dictionaries don’t work for recognizing gene names, chemicals, or other “entities”?