Text mining - sorry.vse.czberka/docs/4iz451/sl14-text-en.pdf · Knowledge Discovery in Databases T14: text mining P. Berka, 2018 1/20

Knowledge Discovery in Databases T14: text mining

P. Berka, 2018 1/20

Text mining

Text mining - data mining on unstructured

textual documents

2 possible approaches:

Data preprocessing + „standard“ data mining

algorithms

Special algorithms for text mining

2 types of tasks:

information retrieval – the documents considered

as a whole (documents correspond to instances)

information extraction – analyze the content of

documents


P. Berka, 2018 2/20

Document representation (preprocessing)

Free text transformed into one row in data table:

lexical analysis (identify words)

lemmatization (transforming of inflected words to

their base form)

ignore stop-words (words, that are not related to

the content of the document – typically

connectives)

row in data table - vector with as many components as

are the terms (words) of a language (bag-of-words).

Terms encoded:

binary values – yes/no occurrence in the document,

number of occurrences in the document,

@relation analcatdata-authorship

@attribute a INTEGER

@attribute all INTEGER

@attribute also INTEGER

@attribute an INTEGER

@attribute and INTEGER

@attribute any INTEGER

@attribute are INTEGER

@attribute as INTEGER

@attribute at INTEGER

@attribute be INTEGER

. . . . .

@attribute Author {Austen,London,Milton,Shakespeare}

@data

46,12,0,3,66,9,4,16,13,13,4,8,8,1,0,1,5,0,21,12,16,3,6,62,3,3,30,3,9,14,1,2,6,5,0,1

0,16,2,54,7,8,1,7,0,4,7,1,3,3,17,67,6,2,5,1,4,47,2,3,40,11,7,5,6,8,4,9,1,0,1,Austen


P. Berka, 2018 3/20

TFIDF value (term frequency inverse document

frequency) – requires the work with the whole collection of

documents (corpus)

TFIDF = n * log M

m

n occurrence of term in the document

m occurrence of term in the whole collection

M documents in the collection

Used to evaluate the similarity between documents based

on occurrence of same terms


P. Berka, 2018 4/20

Advantages:

Invariant w.r.t order of terms in the document

Does not require further preprocessing

Disadvantages:

Cannot express multiword phrases Can be solved using n-grams instead of single terms

e.g. Mistr Jan Hus:

bigrams Mistr Jan, Jan Hus

trigrams Mistr Jan Hus

Does not use the structure of documents Can be solved by weighting the terms

Very large dimensionality of vectors (~ 10 000) –

must be solved during preprocessing attribute selection

wrapper approach = use „brute force“

filter approach = evaluate relevance of terms

attribute transformation

e.g. latent semantic indexing: representation of documents using small number of concepts


P. Berka, 2018 5/20

word2vec – new approach to word representation used to

evaluate the similarity between words based on their

appearance in documents (again requires the work with the

corpus – each word represented by a vector)

consists of 2 parts:

continuous bag-of-words (CBOW)

skip-gram

neural networks based approach, where inputs (and outputs)

encode the co-occurrence of words and their contexts

(neighboring words) in the corpus

CBOW used to predict the probability of a word given a context

(left), skip-gram used to predict the probability of a context

given a word (right)

“pre-trained” by Google and freely available (300-dimensional

vectors for 3 million words and phrases)


P. Berka, 2018 6/20

Example: text preprocessing in SAS Text Miner:

text parsing


P. Berka, 2018 7/20

text filtering


P. Berka, 2018 8/20

text topic


P. Berka, 2018 9/20

Similarity of documents

For two documents x1 ={x11,x12, …, x1m} x2 ={x21,x22, …, x2m}

Cosine measure

simC(x1, x2) = cos (x1, x2) = x1 · x2

||x1|| ||x2||

Symmetric overlap measure

simS(x1, x2) = j min(x1j ,x2j)

min(j x1j , j x2j)

Dice measure

simD(x1, x2) = 2 ||x1 x2||

||x1|| + ||x2|| =

2 x1 · x2

j x1j + j x2j

Jaccard measure

simJ(x1, x2) = ||x1 x2||

||x1 x2|| =

x1 · x2

j x1j + j x2j - x·z

where

x1 x2= j=1

m x1j x2j

||x|| = x·x = j=1

m xj

2


P. Berka, 2018 10/20

A) Information retrieval tasks

document understood as a whole

“Classic” information retrieval: find documents that

best fit to given query

1. boolean model = query condition composed from

terms using logical connectives AND, OR a NOT

doesn’t allow to consider importance of terms in

document

doesn’t allow to consider importance of terms in

query

offers only rough scale (document fits/doesn’t fit)

2. fuzzy extension = offers more values than TRUE,

FALSE

e.g. for query Q given using weighted terms tj:vj a tk:vk and

a document D containing the same terms (with weights w)

tj:wj and tk:wk , the relevance R(D,Q) of document D w.r.t.

query Q , where query is a conjunction tj:vj AND tk:vk

R(D,Q) = min (vj wj,vk wk )

And where query is a conjunction tj:vj OR tk:vk

R(D,Q) = max (vj wj,vk wk ).


P. Berka, 2018 11/20

3. vector model = use similarity measures mentioned

above (both query and documents are vectors)

Evaluate results of retrieval precision and recall

Precision = TP

TP + FP Recall =

TP

TP + FN

Relation between precison and recall

Narrow queries (AND) will result in relatively small number

of retrieved documents, most of them being relevant, broad

queries (OR) will result in relatively large number of

retrieved documents mostly not being relevant


P. Berka, 2018 12/20

Text mining on the level of documents:

text categorization – classification of documents

into more classes

document clustering – similarity based grouping of

documents

document filtering – classification of documents

into 2 classes (interesting vs. uninteresting, spam

vs. ham)

duplication detection – looking for similar

documents (detecting plagiarism)

SAS Document duplication detection


P. Berka, 2018 13/20

sentiment analysis – classification documents

according to emotions expressed by the author

(usually 3 classes: positive, negative and neutral)

SAS sentiment analysis


P. Berka, 2018 14/20

Systems and algorithms for information retrieval

algorithm SMART (System for Manipulating And

Retrieving Text) – vector representation, TFIDF,

cosine measure and symmetric overlap (Salton, 1971)

naive bayes classifier for document classification probabilistic Pic(term_i_ in_document|document_from_class_c)

(Lewis, 1991), (Mitchell, 1997), (Grobelnik, Mladenic,

1998)

Kohonen neural network SOM - geometric

interpretation of SOM is transformed to conceptual

interpretation; the closer are two cluster within SOM,

the closer is the meaning of the corresponding

documents

WebSOM (Honkela, 1996), (Kohonen, 1998) –

categorization of documents in Internet

genetic algorithms - documents represented using bit

strings (chromosome) encoding the occurrence (1) or no-

occurrence (0) of a term, fitness functions corresponds

to a similarity measure (e.g. Jaccard) between document

and query, also represented using bit string (Gordon,

1988)


P. Berka, 2018 15/20

SAS Text Miner


P. Berka, 2018 16/20


P. Berka, 2018 17/20

B) Information extraction tasks

analysis of unstructured text with the aim to find

specific type of information

1. text summarization:

e.g. SAS Text Summarization

Selects important sentences from the text – importance

given by user defined concepts, the more concepts are

identified in the sentence, the more important this sentence

is. Concepts are defined using regular expressions and

grammar rules

Summarization options: whole document, paragraphs, sections

2. named entity recognition: – identifying atomic

elements like names of persons or organizations,

local names, time information

e.g. (Labský, Svátek, 2007) within project MedIEQ


P. Berka, 2018 18/20

3. template mining: identifying sequence of words

(usually defined using regular expression)

e.g. SAS Content Categorization:

classification concept defined using list of words or

„regular expressions“

grammar concept defined using linguistic rules

Definition of grammar concept

Identification of grammar concept


P. Berka, 2018 19/20

Identification of adjectives: precision and recall je 13/17=0.75

4. finding associations: between occurrence of

different phrases in a collection of documents A S, ….. „if writing about A, then also writing about B“

System FACT (Finding Associations in Collections of Text)

- news about political events (Feldman, Hirsh, 1997)

{Iran,USA} Reagan

System Document Explorer - analysis of business texts

(Feldman a kol, 1998)

america online inc, bertelsmann ag joint venture (13, 0.72)

Crucial for automated information extraction is large body

of domain knowledge. In case of system FACT geopolitical

knowledge and linguistic knowledge (synonyms to selected

terms) in case of system Document Explorer knowledge about

companies and firms.


P. Berka, 2018 20/20

Systems for text mining

Intelligent Miner for Text (IBM)

http://www.software.ibm.com/

Text Analyst (Megaputer Intelligence )

http://www.megaputer.com

Text Miner (SAS Institute Inc.)

http://www.sas.com/technologies/analytics/datamining/

textminer

After suitable text preprocessing (transforming documents

into rows in a relational data table) we can use also

„standard“ KDD systems.

weka

Documents

Text mining - sorry.vse.czberka/docs/4iz451/sl14-text-en.pdf · Knowledge Discovery in Databases T14: text mining P. Berka, 2018 1/20