BIOI 7791 Projects in bioinformatics Spring 2005 March 22 © Kevin B. Cohen

BIOI 7791 Projects in bioinformaticsSpring 2005

March 22

PGES upregulates PGE2 production in human thyrocytes

(GeneRIF: 12145315)

• Syntax: what are the relationships between words/phrases?

• Parsing: figuring out the structure– Full parse– Shallow parse

• Shallow parse• Partial parse• Syntactic chunking

Full parse

PGES upregulates PGE2 production in human thyrocytes

Shallow parse

upregulates

PGE2 production

human thyrocytes

NounGroup

VerbGroup

NounGroup

PrepositionalGroup

Shallow vs. full parsing

• Different depths– Full parse goes down to level of individual

words– Shallow parse doesn’t go down any further

than the base phrase

• Different “heights”– Full parse goes “up” to root node– Shallow parse doesn’t (generally) go further

up than base phrase

• Different number of levels of structure– Full parse has many levels– Shallow parse has far fewer

• Either way, you need POS information…

POS tagging: why you need it

• All syntax is built on it

• Overcome sparseness problem by abstracting away from specific words

• Help you decide how to stem

• Potential basis for entity identification

What “POS tagging” is

• POS: part of speech

• School: 8 (noun, verb, adjective, interjection…)

• Real life: 40 or more

How do you get from 8 to 80?

• Noun • NN (noun, singular or mass)• NNS (plural noun)• NNP (proper noun)• NNPS (plural proper noun)

How do you get from 8 to 80?

• Verb • VB (base form)• VBD (past tense)• VBG (gerund)• VBN (past participle)• VBP (singular present-tense non-3rd-

person)• VBZ (3rd-person singular present

tense)

Others that are good to recognize

• Adjective • JJ (adjective)• JJR (comparative

adjective)• JJS (superlative

adjective)

Others that are good to recognize

• Coordinating conjunctions

• Determiners• Prepositions• To• Punctuation

• CC

• DT• IN• TO• , (comma)• . (sentence-final)• : (sentence-medial)

POS tagging

• Definition: assigning POS “tags” to a string of tokens

• Input: – string of tokens– tag set

• Output:– Best tag for each token

How do you define noun, verb, etc.?

• Semantic: – “A noun is a person,

place, or thing…”– “A verb is…”

• Distributional characteristics:– “A noun can take the

plural and genitive morphemes”

– “A noun can appear in the environment All of my twelve hairy ___ left before noon”

Why’s it hard?

Time flies/VBZ like/IN an arrow, but fruit flies/NNS like/VBP a banana.

POS tagging: rule-based

1. Assign each word its list of potential parts of speech

2. Use rules to remove potential tags from the list

The EngCG system:

• 56,000-item dictionary

• 3,744 rules

Note that all taggers need a way to deal with unknown words (OOV or “out-of-vocabulary”).

As always, (about) two approaches….

• Rule-based

• Learning-based

An aside: tagger input formatsapoptosis in a human tumor cell line .

apoptosis/NN in/IN a/DT human/JJ tumor/NN cell/NN line/NN ./.

apoptosis

Just how ambiguous is natural language?

• Most English words are not ambiguous…

• …but, many of the most common ones are.

• Brown corpus: only 11.5% of word types ambiguous…

• …but > 40% of tokens ambiguous.Dictionary doesn’t give you a good estimate of the problem space…

…but corpus data does.

Empirical question: how ambiguous is biomedical text?

A statistical approach: TnT

• Second-order Markov model

• Smoothing by linear interpolation of ngrams• λ estimated by deleted interpolation• Tag probabilities learned for word endings; used

for unknown words

• Ngram: an n-tag or n-word sequence• N = 1

– DET– NOUN– role

• Bigrams– DET NOUN– NOUN PREPOSITION– a role

• Trigrams

The Brill Tagger

The Brill tagger

• Uses rules

• …but, set of rules are induced.

The Brill tagger

• Iterative error reduction1. Assign most common tags, then

2. Evaluate performance, then

The Brill tagger

• Iterative error reduction1. Assign most common tags, then

3. Propose rules to fix errors

5. If you’ve improved, GOTO 3, else END

The Brill tagger

• Change Determiner Verb “of”

• …to…

• Determiner Noun “of”

The/Determiner running/Verb of/IN

The/Determiner running/Noun of/IN

An aside: evaluating POS taggers

• Accuracy• Confusion matrix• How hard is the task? Domain/genre-

specific…– Baseline– Ceiling– State of the art:

• 96-97% total accuracy• Lower for non-punctuation

Give each word its most common tag

Interannotator agreement

--usually high 90’s

Low 90’s on some corpora!

Confusion matrix

JJ NN VBD

JJ -- .6 4.6

NN .5 --

VBD 5.4 .01 --

Columns = tagger output

Rows = right answer

An aside: unknown words

• Call them all nouns

• Learn most common POS from training data

• Use morphology

• Suffix trees

• Other features, e.g. hyphenation (JJ in Brown; biomed?), capitalization…

POS tagging: extension(s)

• Entity identification

• What else??

• First step in any POS tagging effort: – Tokenization– …maybe sentence segmentation

First programming assignment: tokenization

• What was hard?

• What if I told you that dictionaries don’t work for recognizing gene names, chemicals, or other “entities”?

BIOI 7791 Projects in bioinformatics Spring 2005 March 22 © Kevin B. Cohen

Documents

Cohen 2015

Dr. Cohen

Cohen Thesis

prem safety Food safety main fly sheet · 2020-07-12 · Prem Safety T. +44 (0) 7791 481864 E. phil@premsafety.com W. T. +44 (0) 7791 481864 Food Safety Training The production, processing,

BIOMAGNETISM - David Cohen | David Cohen MIT Physicist

Cohen lecture

Sara Cohen

2002 BIOi Final Project 1 A Distributed DNA Search Database System

Bjil27.2 Cohen

SAQA US ID 7791 CULTURAL AWARENESS IN DEALING WITH

BARUCH AN HAEMET Combined.pdf · Norman Cohen, Esther Dayon, Jacqueline Arussy, Saul Cohen, David Cohen Siblings: Renee Ashkenazi Cohen, Cynthia Shalom, Charlie Cohen, Marcelle Bale

Baron cohen

Cohen Book

Lynne Cohen

Parallel evaluation of the BiI3, BiOI, and Ag3BiI6 layered

Matt Cohen

DGE 2B EN - Cabinet Officeeuropeanmemoranda.cabinetoffice.gov.uk/files/2017/... · 7791/17 ADD 4 CB/ek DGE 2B EN Council of the European Union Brussels, 29 March 2017 (OR. en) 7791/17

FOR lease > OFFICe sPaCe 7791 79th Street South · FOR lease > OFFICe sPaCe 7791 79th Street South cottage grove, Mn 55016 ... 070906 7791 79th St South:2-sided flyer.qxd.qxd Author:

Sri Lankan]. Bioi. 2017, 2 (2): 46-59 (2) - SLJOL

Cohen Article