15
Text segmentation Amany AlKhayat

Text segmentation Amany AlKhayat. Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation,

Embed Size (px)

DESCRIPTION

Tokenization Tokenization and sentence splitting can be described as ‘low-level’ segmentation which is performed at the initial level of text processing. The tasks are handled by reg. ex. Written in perl or any other programming language.

Citation preview

Page 1: Text segmentation Amany AlKhayat. Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation,

Text segmentation

Amany AlKhayat

Page 2: Text segmentation Amany AlKhayat. Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation,

• Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation, numbers.

• This process is called tokenization and segmented units are called word tokens.

• Ex: In addition, she was there.• After segmentation:In addition , she was there .

Page 3: Text segmentation Amany AlKhayat. Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation,

Tokenization

• Tokenization and sentence splitting can be described as ‘low-level’ segmentation which is performed at the initial level of text processing. The tasks are handled by reg. ex. Written in perl or any other programming language.

Page 4: Text segmentation Amany AlKhayat. Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation,

Tokenization II

• High-level text segmentation or intrasenetential segmentation involves segmentation of linguistic groups such as named entities, segmentation of noun groups.

• Inter-sentential segmentation involves grouping of sentences and paragraphs into discourse topics which are also called text tiles.

Page 5: Text segmentation Amany AlKhayat. Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation,

Word segmentation

• Multiple occurrence of words in a text.• Word types are word of vocabulary.• Ex. If Shakespeare’s works included more than

8oo,ooo word tokens, it has 31,000 types of vocabulary

Page 6: Text segmentation Amany AlKhayat. Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation,

Tokenizing sentences

• It is tiresome to tokenize sentences by adding white space. Moreover, if you tokenize sentences they cannot be put back to normal.

• SGML or XML are cleaner strategies for tokenization to revert it easily to original text.

• Ex.<w c=w> it</w> <w c=w> is </w> <w c=w> here

</w> <w c=p>. </w>

Page 7: Text segmentation Amany AlKhayat. Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation,

Sentence segmentation

• Important for many text processing apps: syntactic parsing, information extraction, text alignment, Machine translation…etc.

Page 8: Text segmentation Amany AlKhayat. Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation,

• Accurate splitting is known as sentence boundary disambiguation (SBD) requires analysis of the local context around the periods and othe punctuations

• Compare:• He stopped to see Dr. White.• He stopped at Meadows Dr. Whie falcon was still

open. Which period is sentence internal and which one is

sentence terminal?

Page 9: Text segmentation Amany AlKhayat. Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation,

Simplist algorithm for sentence boundary disambiguation

• ‘period- space- capital letter’• It marks all periods, exclamation marks and q

marks that are followed by a space and a capital letter.

• Regex:• [.?!][ ()”]+[A-Z]

Page 10: Text segmentation Amany AlKhayat. Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation,

Part of speech tagging

• Criteria:• 1- syntactic distribution• 2- syntactic function• 3- morphological and syntactic classes that

different parts of speech can be assigned to.

Page 11: Text segmentation Amany AlKhayat. Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation,

Applications

• Preprocessors• Large tagged text corpora (see Mark Davies

Corpus)• Info technology apps: text indexing and

retrieval (nouns and adjectives are better candidates for good indexing than adverbs, verbs and pronouns

Page 12: Text segmentation Amany AlKhayat. Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation,

Parsing

• See Stanford university parser online (http://nlp.stanford.edu:8080/parser/index.jsp)

• Using grammar to assign syntactic analysis to a string of words.

• Shallow parsing: partition of the input into chunks identifying the headword of each chunk.

Page 14: Text segmentation Amany AlKhayat. Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation,

CFP context free parsing

• Context-free grammars are important in linguistics for describing the structure of sentences and words in natural language, and in computer science for describing the structure of programming languages and other formal languages. (wikipedia)

Page 15: Text segmentation Amany AlKhayat. Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation,

Thank you