Text mining

Text MiningFernando GamaAcadêmico de Sistemas de Informação - UFPABolsista de IC (Instituto Tecnológico Vale - ITV)

What will we approach? Introduction Approach KDT Categorization Classification Mining Task Experiments A survey of Text Mining: Retrieval, Extraction and Indexing Techniques Discovery Co-relations on Research Topics and Authors from the PubMed Database Conclusion Credits

Introduction Concept-Based Knowledge Discovery in Texts Extracted from the web

- Statistical Techniques are applied on concepts.

- To find patterns in concepts.

- For indentifying concepts in texts, applying Categorization algorithm.

- Classification task is associated with categorization algorithm for concept definitions.

Introduction Scene WEB:

growing collection of texts;people want useful information and extract informations quickly and with low cost;Problem: information overload problem! KDT(Knowledge Discovery Text): keywords should be previously assigned text.

Manually by humans

Software tools

Texts to categorize documents: find associations.

+ frequent = keyword(attributes)

vocabularyproblem

Terms

Introduction

Goal:to minimize the vocabulary problem.to minimize the effort necessary to extract useful information.the discovery process works over concepts extracted from texts.to combine categorization task and mining task.

Categorization: concepts presents.Mining: discovers patterns.


Approach for KDT

What is concept definition?

- Dictionary: idea, opinion, thought.

INFORMATION RETRIEVAL (IR)to index

to retrieve documents

“Concepts expressed by a language are determined by environment, culture of the

people who speak the language”.

Approach for KDT

The goals is:- building a simple structure, that allows to represent real world objects, events, thoughts,opinions, ideas, easily and with a certain degreeof quality for the discovery process.

To represent concepts internally.Concept is stored as set or vector of terms.Non-ordered vector: to simplify classification

and categorization task.

Approach for KDT

KDT approach against the KDD phases:1. Understanding the application domain and the goals of the data mining process.2. Selecting a target data set: texts must be gathered. (tools/manual).3. Integrating e checking the data set: texts must be saved(.*txt)4. Data cleaning, preprocessing and transformation: concepts must be described and texts need to analyzed and stored in the internal format.5. Model development and hypothesis building: identifying concepts in the collection.6. Choosing suitable data mining algorithims.7. Result interpretation and visualization: humans must interpret the conclusion.8. Result testing and verification.9. Using and maintaining the discovery knowledge: done by humans.


Categorization

The goal:Identify concepts present in texts.The approach: is based on a simple technique.

TermsConcepts Texts

Fuzzy reasoning

IE

Categorization

Rocchio Algorithm CategorizationBuild prototype-like vector to represent each class/category (concepts).

Advantages:Simple;Easy to implement;Relative efficiency.

Main Disavantage:Context of words doesn't influence the categorization.

Categorization

Rocchio Algorithm CategorizationOperation:

1. Concepts were defined? Texts were represented in the internal format?2. Compare all texts against each concept. (fuzzy)3. Common terms presents (Weigths multiplied).4. The overall sum = 1.

Concept Text0...1

Categorization

Rocchio Algorithm CategorizationIn approach:(TERM CONCEPT)Analyzing the strenght according indicatiors.

Abductive Reasoning:

“A B”If “A is truth” then we infer “B is truth”.Conclusion: Words that describe a concept appear in a text = high of that concept being present in that text.

Categorization

“A B”If “A is truth” then we infer “B is truth”.

Conclusion:Text Concept

Words

Set

Categorization

How are we decide whether a concept is present or not?

Categorization

Decision: depend threshold + context analysis.


Classification

The goal:Generate concept definitions (choice and description of each concept).

Is possible:- to use existing controlled vocabulary (dictionaries, thesauri, ontologies) / or automatically generate one.

Problems: 1. thesauri: very domain dependent and they don't have sufficient vocabulary coverage.2. ontologies: fail to include proper nouns.3. dictionary: sometimes don't include important semantic relations.

Preexisting vocabularies may not be appropriated to the user's need.

Classification

Automatic Generation of a controlled vocabulary:

Learning Process

Supervised process

Unsupervised process

Classification

Automatic Generation of a controlled vocabulary:

Supervised process: a set input data (training data). Analysis the set data and validation is applied on algorithms.

Unsupervised process: no one previously knowledge is available. Suggests the clustering technique. Learning across observation.

Classification

Problem: high-quality sample of data must beavailable.

Problem: classes are indetified and createdindepent user's interest.

Classification

In this approach:

The goal is use a method that could be efficient, low cost in terms of time and effort.

Manual Process + dictionaries and software tools.Technical dictionary + thesaurus.

examine Sample of colection words

frequency

context


Mining Task

The goal:- Analyzes concept distributions to discover interesting patterns.

- Probabilistic and statistical paradigm: distribution of variables in the collection.

- Assumming that only is important to know if a concept is present or not inside text.

Mining Task

The first technique:- Key-concept listing: analyzes concept distributions over the collection.

Goal: Allows for finding which dominant themes exist in a collection or in a single text.

Text associated

concept degreeConcept 1Concept 2Concept 3

1

1

1How much number os texts to which the concept is assigned.

Mining Task

The second technique:- Association: discoveries associations between concepts expresses thesefindings.

1. Suport: proportion of texts that have x and y in relation to all texts in the collection.2. Confidence: proportion of texts that have x and y in relation the number of texts that have only x.

Mining Task


Experiments

Two experiments:- Political analysis context;- and competitive intelligence (business intelligence);

:: Classification task was different in each experiment.:: These experiments are complementary.

Experiments

Political experiments:- Exhaustive Analysis of words.- Each word being examined could be classified into an existing concept or generate a new one.- Stopwords and general terms were eliminated previously.

Competitive experiment:- Interesting concepts were first selected.- Each concept were defined and refined for examination of words present in the collection.

Experiments

In these experiments:- Support equals to 60% - and confidence threshold of 80%

Experiments

Political Experiments- Goal of this experiment: extract knowledge about what press is or was telling about the mayor of a big city in Brazil.

Newspaper Portuguese180 texts

178 texts

1997

1999Sub-collections

Experiments

Association rules (association technique)

a) drug traffic politicians (confidence = 93.3%, support = 14 documents)

b) loans politicians (confidence = 82.1%, support = 14 documents)- discovery in 1997's sub-collection.

Sphere Political

importance degree

Experiments

Association rules (association technique)

c) combination of 2 patterns:

“education ” and “loans” can have connection.

(1) loans politicians (confidence = 82.1%, support = 23 documents)(2) education politicians (confidence = 64.2%, support = 27 documents)

(3) education loans (confidence = 4.7%, support = 2 documents)(3) loans education (confidence = 7.1%, support = 2 documents)

Experiments

However...

Experiments

When analyzing these two concepts together:

(5) loans AND education politicians (confidence = 83.3%, support = 5 documents)

(6) loans AND politicians education (confidence = 17.2%)

Experiments

When analyzing these two concepts together:

(5) loans AND education politicians (confidence = 83.3%, support = 5 documents)

(6) loans AND politicians education (confidence = 17.2%)

Experiments

Concept distributions (key-concept listing technique)

Whole collection = 358 texts.

Comparing the distributions of concepts (1997 and 1999):1. 1997 (dominant focus): presence of politicians associated with the mayor while in 1999 the themes had a balanced distribution.2. the weight of the “elections” concept: 1997 = 25% and 1999 = 33.7%.3. the “debts” concept reduced its participation from 1997 to 1999.

And so on...

Politicians140 texts

39.1%

Crimes117 texts

32.6%

Elections105 texts

29.3%

Experiments

Competitive Intelligence Experiments- Goal of this experiment: compare Text Mining tools, examining the techniques used and the benefits cited by the vendors os these tools. Addition, to relate techniques and benefits, in order to discover which techniques to use when needing a certain benefit.


A survey of Text Mining: Retrieval, Extraction and Indexing Techniques

Information Retrieval- It's has been developing in parallel with database systems.However:Database => processing structured data.Information Retrieval (IR) => organization and retrieval of information.Handle different kinds of data.IR has found many applications.IR (Problems): to locate relevants documents in a document collection based on user's query.


Information Retrieval (Measures for Retrieval)- Quality of text retrieval:

Relevant documents

Retrieved documents

Relevant and Retrieved

All documents Venn Diagram


Information Retrieval (Measures for Retrieval)- Measure the quality of a ranked list of documents:


Text Indexing Techniques- Text retrieval indexing techniques.Inverted Index: index structure that maintains two hash indexed: document and term table.Document table => consists of a set of documents records. (doc id + posting list).Term table => consists of a set of documents records. (term id + posting list).


Text Indexing TechniquesAdvantages: - Widely used in industry.- Easy to implement.Disadvantages: - Posting list is not handling synonymy, polysemy.- storage requirement large.

Signature file:Store a signature record for each document in the database.Hold a “signature” record store in main file.


Text Indexing TechniquesAssignature file:Advantages: - Little storage space.

alta

alegre

elegante

espirituoso

esperto

forte

gracioso

envolvente

ousado

1 0 1 1 0 0 0 1 1

0 1 1 0 1 0 1 0 1

0 1 0 1 1 1 0 1 0

João

Maria

Pedro


Query Processing Techniques

DB NOSQL

Problems:Synonymy => automobile and vehicle.Polysemy => same keywords but mean different.


Information ExtractionInformation Extraction (IE) is a process of extracting from documents, facts about types of events, entities or relationships. These facts are then usually entered automatically into a database or spreadsheet, which may then be used to analyze the data for trends, to give a natural language summary, or may be used for indexing purposes in Information Retrieval (IR) applications.

Information Retrieval:Finds texts and

presents them to user.

Information Extraction:Analyzes texts presents

according specific informations


Information Extraction

Information Retrieval


Information Extraction: Layer model of the Text Mining Application

Notice these aspects...


Information Extraction - StemmingIdentifying the root of a certain word.

Derivational: create a new word from an existing word.

Inflectional: normalization is limited to regularizing grammatical variants(sing/plu or past/pres).

Eg. apply – applied- appliesprint – printing – prints – printed

Porter stemming algorithm:- minimize the effects of inflection;- morphological variations of Words.


Information Extraction – Domain DictionaryIt's Necessary to provide them with a knowledge base.

The structure of Domain Dictionary: 3 levels hierarchy:

Parent Category + Sub-category + word.

Main category, will be unique on its level.

Belong to a certain parent category.All words associated with it.

Dependant of thecategories previously.


Information Extraction – Exclusion ListA lot of words in a text file can be treated as unwanted noise.

Necessary to eliminated them: separate file which includes all such words.

Words such as: the, a, an, if, off, on, in, etc...

What will we approach? Introduction Approach KDT Categorization Classification Mining Task Experiments A survey of Text Mining: Retrieval, Extraction and Indexing Techniques Discovery Co-relations on Research Topics and Authors from the PubMed

Database Conclusion Credits

Discovery Co-relations on Research Topics and Authors from the PubMed Database

1. PubMedIs a free search engine that provides very full coverage of related biomedical sciences, such as biochemistry and cell biology. It also offers access to the MEDLINE database with citations and abstracts of biomedical research articles.

1.1. PubMed data structure- + 17 millions of citations with the same structure.- files are intended for automatic processing. (XML).- 30.000 PubMed citations = XML instance defined by a DTD.


Set of Information such as:PubMed Identifier + Publication year + Mesh terms + Author's name.

Parser has been developed:

1.2. Generating a keyword file

Parser

PubMed file 1PubMed file 2PubMed file 3

PubMed file 4PubMed file 5PubMed file 6

New texts files One citation entry17191901, 2004, Erythrina, Plant Extracts, Plant Roots,chemistry, isolation purification, TANAKA_H, HIRATA_M, ETOH_H,SATO_M, MURATA_J, MURATA_H, DARNAEDI_D, FUKAI_T

keywords + authors


MeSH

1.2. Generating a keyword file

NLM CreatorDATAMantainer

Provider

keywordsanalysis

frequency

- Primary concepts and alternative descriptions.

- of types occurences and medical terms.

- terms MeSH.


2. Pre-Processing the Data

- Datasets obtained for the year (2003,2004,2005).- SQL Server 2005 Database.

OPERATIONS: 1. removing noise from data (irregular characters).2. organize data for more efficient access.


2. Pre-Processing the Data

DsPM: input dataset through the parsing of the PubMed XML files.

Top-5-KW and Top-1-A: indicate most frequency.


3. Mining the PubMed data

Association Rule (AR): A C

Support : number of database entries where this rule appears.Confidence : probability that an entry in DB that contains A will also contain C.

Dependency Networks(DN): are graphical models that represent joint distributions for a set of variables. DN are useful to learn and describe probabilistic relationships on data.



Discovering Dependency Networks(DN)



Discovering Dependency Networks(DN)

High Confidence value: high probabilityof co-occurrence of author of consequent.

High Lift value: when antecendent occurthere is high probability of co-occurrence of author of consequent.


Conclusion We can saw that: Choose to use a manual task + automatic tools + existing vocabularies. Automatic methods: can help user to find terms related to categories, lexical variations, local

synonymous, frequencies, etc. Human Intervention is important!To evaluate the categorization method, formal experiments were carried out: Texts extracted from web were gathered:

5 TOOLS 13 concepts8 tasks

5 methods

Conclusion Microaveraging precision = 0.59; Macroaveraging precision = 0.54; Microaveraging recall = 0.95; Macroaveraging recall = 0.86; Fallout = 0.62;

Microaveraging precision = 0.65; Macroaveraging precision = 0.69; Microaveraging recall = 0.89; Macroaveraging recall = 0.93; Fallout = 0.28;

Macroaveraging precision:0.61Macroaveraging recall:0.97

Classification

Negative terms + ambiguous words

CreditsFERREIRA, P, G.; LIBRELOTTO, Giovani; ALVES, Ronnie. Discovering Co-Relations on Research Topics and Authors from the PubMed Database.

LOH, S.; WIVES, L.K.; OLIVEIRA, J.P.M. Concept-Based Knowledge Discovery in Texts Extracted from the Web.

SAGAYAM, R.; Srinivasan, S.; Roshni.S. A Survey of Text Mining: Retrieval, Extraction and Indexing Techniques.

Physicsandcake. Teaching Artificial Intelligences using Quantum Computers. Disponível em: <http://dwave.wordpress.com/2011/05/27/teaching-artificial-intelligences-using-quantum-computers/>. Acesso 06/10/2013.

Cat Casey. Predictive Analytics and Artificial Intelligence... Science Fiction or E-Discovery Truth?. Disponível em: <http://hudsonlegalblog.com/e-discovery/predictive-analytics-artificial-intelligence-science-fiction-e-discovery-truth.html>. Acesso em 07/10/2013.

Traina, Ribeiro, Cordeiro, Romani, Sousa, Avila, Zullo, Traina, Rodrigues. How to Find Relevant Patterns in Climate Data: an Efficient and Effective Framework to Mine Climate Time Series and Remote Sensing Images. Disponível em: <http://www.gbdi.icmc.usp.br/agrodatamine/node/33>. Acesso em:07/10/2013.

http://dwave.wordpress.com/2011/05/27/teaching-artificial-intelligences-using-quantum-computers/

http://hudsonlegalblog.com/e-discovery/predictive-analytics-artificial-intelligence-science-fiction-e-discovery-truth.html

http://www.gbdi.icmc.usp.br/agrodatamine/node/33

Technology

Text mining