67
Text Mining Fernando Gama Acadêmico de Sistemas de Informação - UFPA Bolsista de IC (Instituto Tecnológico Vale - ITV)

Text mining

Embed Size (px)

DESCRIPTION

Apresentação realizada no ITV (Instituto Tecnológico Vale)

Citation preview

Page 1: Text mining

Text MiningFernando GamaAcadêmico de Sistemas de Informação - UFPABolsista de IC (Instituto Tecnológico Vale - ITV)

Page 2: Text mining

What will we approach? Introduction Approach KDT Categorization Classification Mining Task Experiments A survey of Text Mining: Retrieval, Extraction and Indexing Techniques Discovery Co-relations on Research Topics and Authors from the PubMed Database Conclusion Credits

Page 3: Text mining

Introduction Concept-Based Knowledge Discovery in Texts Extracted from the web

- Statistical Techniques are applied on concepts.

- To find patterns in concepts.

- For indentifying concepts in texts, applying Categorization algorithm.

- Classification task is associated with categorization algorithm for concept definitions.

Page 4: Text mining

Introduction Scene WEB:

growing collection of texts;people want useful information and extract informations quickly and with low cost;Problem: information overload problem! KDT(Knowledge Discovery Text): keywords should be previously assigned text.

Manually by humans

Software tools

Texts to categorize documents: find associations.

+ frequent = keyword(attributes)

vocabularyproblem

Terms

Page 5: Text mining

Introduction

Goal:to minimize the vocabulary problem.to minimize the effort necessary to extract useful information.the discovery process works over concepts extracted from texts.to combine categorization task and mining task.

Categorization: concepts presents.Mining: discovers patterns.

Page 6: Text mining

What will we approach? Introduction Approach KDT Categorization Classification Mining Task Experiments A survey of Text Mining: Retrieval, Extraction and Indexing Techniques Discovery Co-relations on Research Topics and Authors from the PubMed Database Conclusion Credits

Page 7: Text mining

Approach for KDT

What is concept definition?

- Dictionary: idea, opinion, thought.

INFORMATION RETRIEVAL (IR)to index

to retrieve documents

“Concepts expressed by a language are determined by environment, culture of the

people who speak the language”.

Page 8: Text mining

Approach for KDT

The goals is:- building a simple structure, that allows to represent real world objects, events, thoughts,opinions, ideas, easily and with a certain degreeof quality for the discovery process.

To represent concepts internally.Concept is stored as set or vector of terms.Non-ordered vector: to simplify classification

and categorization task.

Page 9: Text mining

Approach for KDT

KDT approach against the KDD phases:1. Understanding the application domain and the goals of the data mining process.2. Selecting a target data set: texts must be gathered. (tools/manual).3. Integrating e checking the data set: texts must be saved(.*txt)4. Data cleaning, preprocessing and transformation: concepts must be described and texts need to analyzed and stored in the internal format.5. Model development and hypothesis building: identifying concepts in the collection.6. Choosing suitable data mining algorithims.7. Result interpretation and visualization: humans must interpret the conclusion.8. Result testing and verification.9. Using and maintaining the discovery knowledge: done by humans.

Page 10: Text mining

What will we approach? Introduction Approach KDT Categorization Classification Mining Task Experiments A survey of Text Mining: Retrieval, Extraction and Indexing Techniques Discovery Co-relations on Research Topics and Authors from the PubMed Database Conclusion Credits

Page 11: Text mining

Categorization

The goal:Identify concepts present in texts.The approach: is based on a simple technique.

TermsConcepts Texts

Fuzzy reasoning

IE

Page 12: Text mining

Categorization

Rocchio Algorithm CategorizationBuild prototype-like vector to represent each class/category (concepts).

Advantages:Simple;Easy to implement;Relative efficiency.

Main Disavantage:Context of words doesn't influence the categorization.

Page 13: Text mining

Categorization

Rocchio Algorithm CategorizationOperation:

1. Concepts were defined? Texts were represented in the internal format?2. Compare all texts against each concept. (fuzzy)3. Common terms presents (Weigths multiplied).4. The overall sum = 1.

Concept Text0...1

Page 14: Text mining

Categorization

Rocchio Algorithm CategorizationIn approach:(TERM CONCEPT)Analyzing the strenght according indicatiors.

Abductive Reasoning:

“A B”If “A is truth” then we infer “B is truth”.Conclusion: Words that describe a concept appear in a text = high of that concept being present in that text.

Page 15: Text mining

Categorization

“A B”If “A is truth” then we infer “B is truth”.

Conclusion:Text Concept

Words

Set

Page 16: Text mining

Categorization

How are we decide whether a concept is present or not?

Page 17: Text mining

Categorization

Decision: depend threshold + context analysis.

Page 18: Text mining

What will we approach? Introduction Approach KDT Categorization Classification Mining Task Experiments A survey of Text Mining: Retrieval, Extraction and Indexing Techniques Discovery Co-relations on Research Topics and Authors from the PubMed Database Conclusion Credits

Page 19: Text mining

Classification

The goal:Generate concept definitions (choice and description of each concept).

Is possible:- to use existing controlled vocabulary (dictionaries, thesauri, ontologies) / or automatically generate one.

Problems: 1. thesauri: very domain dependent and they don't have sufficient vocabulary coverage.2. ontologies: fail to include proper nouns.3. dictionary: sometimes don't include important semantic relations.

Preexisting vocabularies may not be appropriated to the user's need.

Page 20: Text mining

Classification

Automatic Generation of a controlled vocabulary:

Learning Process

Supervised process

Unsupervised process

Page 21: Text mining

Classification

Automatic Generation of a controlled vocabulary:

Supervised process: a set input data (training data). Analysis the set data and validation is applied on algorithms.

Unsupervised process: no one previously knowledge is available. Suggests the clustering technique. Learning across observation.

Page 22: Text mining

Classification

Problem: high-quality sample of data must beavailable.

Problem: classes are indetified and createdindepent user's interest.

Page 23: Text mining

Classification

In this approach:

The goal is use a method that could be efficient, low cost in terms of time and effort.

Manual Process + dictionaries and software tools.Technical dictionary + thesaurus.

examine Sample of colection words

frequency

context

Page 24: Text mining

What will we approach? Introduction Approach KDT Categorization Classification Mining Task Experiments A survey of Text Mining: Retrieval, Extraction and Indexing Techniques Discovery Co-relations on Research Topics and Authors from the PubMed Database Conclusion Credits

Page 25: Text mining

Mining Task

The goal:- Analyzes concept distributions to discover interesting patterns.

- Probabilistic and statistical paradigm: distribution of variables in the collection.

- Assumming that only is important to know if a concept is present or not inside text.

Page 26: Text mining

Mining Task

The first technique:- Key-concept listing: analyzes concept distributions over the collection.

Goal: Allows for finding which dominant themes exist in a collection or in a single text.

Text associated

concept degreeConcept 1Concept 2Concept 3

1

1

1How much number os texts to which the concept is assigned.

Page 27: Text mining

Mining Task

The second technique:- Association: discoveries associations between concepts expresses thesefindings.

1. Suport: proportion of texts that have x and y in relation to all texts in the collection.2. Confidence: proportion of texts that have x and y in relation the number of texts that have only x.

Page 28: Text mining

Mining Task

Page 29: Text mining

What will we approach? Introduction Approach KDT Categorization Classification Mining Task Experiments A survey of Text Mining: Retrieval, Extraction and Indexing Techniques Discovery Co-relations on Research Topics and Authors from the PubMed Database Conclusion Credits

Page 30: Text mining

Experiments

Two experiments:- Political analysis context;- and competitive intelligence (business intelligence);

:: Classification task was different in each experiment.:: These experiments are complementary.

Page 31: Text mining

Experiments

Political experiments:- Exhaustive Analysis of words.- Each word being examined could be classified into an existing concept or generate a new one.- Stopwords and general terms were eliminated previously.

Competitive experiment:- Interesting concepts were first selected.- Each concept were defined and refined for examination of words present in the collection.

Page 32: Text mining

Experiments

In these experiments:- Support equals to 60% - and confidence threshold of 80%

Page 33: Text mining

Experiments

Political Experiments- Goal of this experiment: extract knowledge about what press is or was telling about the mayor of a big city in Brazil.

Newspaper Portuguese180 texts

178 texts

1997

1999Sub-collections

Page 34: Text mining

Experiments

Association rules (association technique)

a) drug traffic politicians (confidence = 93.3%, support = 14 documents)

b) loans politicians (confidence = 82.1%, support = 14 documents)- discovery in 1997's sub-collection.

Sphere Political

importance degree

Page 35: Text mining

Experiments

Association rules (association technique)

c) combination of 2 patterns:

“education ” and “loans” can have connection.

(1) loans politicians (confidence = 82.1%, support = 23 documents)(2) education politicians (confidence = 64.2%, support = 27 documents)

(3) education loans (confidence = 4.7%, support = 2 documents)(3) loans education (confidence = 7.1%, support = 2 documents)

Page 36: Text mining

Experiments

However...

Page 37: Text mining

Experiments

When analyzing these two concepts together:

(5) loans AND education politicians (confidence = 83.3%, support = 5 documents)

(6) loans AND politicians education (confidence = 17.2%)

Page 38: Text mining

Experiments

When analyzing these two concepts together:

(5) loans AND education politicians (confidence = 83.3%, support = 5 documents)

(6) loans AND politicians education (confidence = 17.2%)

Page 39: Text mining

Experiments

Concept distributions (key-concept listing technique)

Whole collection = 358 texts.

Comparing the distributions of concepts (1997 and 1999):1. 1997 (dominant focus): presence of politicians associated with the mayor while in 1999 the themes had a balanced distribution.2. the weight of the “elections” concept: 1997 = 25% and 1999 = 33.7%.3. the “debts” concept reduced its participation from 1997 to 1999.

And so on...

Politicians140 texts

39.1%

Crimes117 texts

32.6%

Elections105 texts

29.3%

Page 40: Text mining

Experiments

Competitive Intelligence Experiments- Goal of this experiment: compare Text Mining tools, examining the techniques used and the benefits cited by the vendors os these tools. Addition, to relate techniques and benefits, in order to discover which techniques to use when needing a certain benefit.

Page 41: Text mining

What will we approach? Introduction Approach KDT Categorization Classification Mining Task Experiments A survey of Text Mining: Retrieval, Extraction and Indexing Techniques Discovery Co-relations on Research Topics and Authors from the PubMed Database Conclusion Credits

Page 42: Text mining

A survey of Text Mining: Retrieval, Extraction and Indexing Techniques

Information Retrieval- It's has been developing in parallel with database systems.However:Database => processing structured data.Information Retrieval (IR) => organization and retrieval of information.Handle different kinds of data.IR has found many applications.IR (Problems): to locate relevants documents in a document collection based on user's query.

Page 43: Text mining

A survey of Text Mining: Retrieval, Extraction and Indexing Techniques

Information Retrieval (Measures for Retrieval)- Quality of text retrieval:

Relevant documents

Retrieved documents

Relevant and Retrieved

All documents Venn Diagram

Page 44: Text mining

A survey of Text Mining: Retrieval, Extraction and Indexing Techniques

Information Retrieval (Measures for Retrieval)- Measure the quality of a ranked list of documents:

Page 45: Text mining

A survey of Text Mining: Retrieval, Extraction and Indexing Techniques

Text Indexing Techniques- Text retrieval indexing techniques.Inverted Index: index structure that maintains two hash indexed: document and term table.Document table => consists of a set of documents records. (doc id + posting list).Term table => consists of a set of documents records. (term id + posting list).

Page 46: Text mining

A survey of Text Mining: Retrieval, Extraction and Indexing Techniques

Text Indexing TechniquesAdvantages: - Widely used in industry.- Easy to implement.Disadvantages: - Posting list is not handling synonymy, polysemy.- storage requirement large.

Signature file:Store a signature record for each document in the database.Hold a “signature” record store in main file.

Page 47: Text mining

A survey of Text Mining: Retrieval, Extraction and Indexing Techniques

Text Indexing TechniquesAssignature file:Advantages: - Little storage space.

alta

alegre

elegante

espirituoso

esperto

forte

gracioso

envolvente

ousado

1 0 1 1 0 0 0 1 1

0 1 1 0 1 0 1 0 1

0 1 0 1 1 1 0 1 0

João

Maria

Pedro

Page 48: Text mining

A survey of Text Mining: Retrieval, Extraction and Indexing Techniques

Query Processing Techniques

DB NOSQL

Problems:Synonymy => automobile and vehicle.Polysemy => same keywords but mean different.

Page 49: Text mining

A survey of Text Mining: Retrieval, Extraction and Indexing Techniques

Information ExtractionInformation Extraction (IE) is a process of extracting from documents, facts about types of events, entities or relationships. These facts are then usually entered automatically into a database or spreadsheet, which may then be used to analyze the data for trends, to give a natural language summary, or may be used for indexing purposes in Information Retrieval (IR) applications.

Information Retrieval:Finds texts and

presents them to user.

Information Extraction:Analyzes texts presents

according specific informations

Page 50: Text mining

A survey of Text Mining: Retrieval, Extraction and Indexing Techniques

Information Extraction

Information Retrieval

Page 51: Text mining

A survey of Text Mining: Retrieval, Extraction and Indexing Techniques

Information Extraction: Layer model of the Text Mining Application

Notice these aspects...

Page 52: Text mining

A survey of Text Mining: Retrieval, Extraction and Indexing Techniques

Information Extraction - StemmingIdentifying the root of a certain word.

Derivational: create a new word from an existing word.

Inflectional: normalization is limited to regularizing grammatical variants(sing/plu or past/pres).

Eg. apply – applied- appliesprint – printing – prints – printed

Porter stemming algorithm:- minimize the effects of inflection;- morphological variations of Words.

Page 53: Text mining

A survey of Text Mining: Retrieval, Extraction and Indexing Techniques

Information Extraction – Domain DictionaryIt's Necessary to provide them with a knowledge base.

The structure of Domain Dictionary: 3 levels hierarchy:

Parent Category + Sub-category + word.

Main category, will be unique on its level.

Belong to a certain parent category.All words associated with it.

Dependant of thecategories previously.

Page 54: Text mining

A survey of Text Mining: Retrieval, Extraction and Indexing Techniques

Information Extraction – Exclusion ListA lot of words in a text file can be treated as unwanted noise.

Necessary to eliminated them: separate file which includes all such words.

Words such as: the, a, an, if, off, on, in, etc...

Page 55: Text mining

What will we approach? Introduction Approach KDT Categorization Classification Mining Task Experiments A survey of Text Mining: Retrieval, Extraction and Indexing Techniques Discovery Co-relations on Research Topics and Authors from the PubMed

Database Conclusion Credits

Page 56: Text mining

Discovery Co-relations on Research Topics and Authors from the PubMed Database

1. PubMedIs a free search engine that provides very full coverage of related biomedical sciences, such as biochemistry and cell biology. It also offers access to the MEDLINE database with citations and abstracts of biomedical research articles.

1.1. PubMed data structure- + 17 millions of citations with the same structure.- files are intended for automatic processing. (XML).- 30.000 PubMed citations = XML instance defined by a DTD.

Page 57: Text mining

Discovery Co-relations on Research Topics and Authors from the PubMed Database

Set of Information such as:PubMed Identifier + Publication year + Mesh terms + Author's name.

Parser has been developed:

1.2. Generating a keyword file

Parser

PubMed file 1PubMed file 2PubMed file 3

PubMed file 4PubMed file 5PubMed file 6

New texts files One citation entry17191901, 2004, Erythrina, Plant Extracts, Plant Roots,chemistry, isolation purification, TANAKA_H, HIRATA_M, ETOH_H,SATO_M, MURATA_J, MURATA_H, DARNAEDI_D, FUKAI_T

keywords + authors

Page 58: Text mining

Discovery Co-relations on Research Topics and Authors from the PubMed Database

MeSH

1.2. Generating a keyword file

NLM CreatorDATAMantainer

Provider

keywordsanalysis

frequency

- Primary concepts and alternative descriptions.

- of types occurences and medical terms.

- terms MeSH.

Page 59: Text mining

Discovery Co-relations on Research Topics and Authors from the PubMed Database

2. Pre-Processing the Data

- Datasets obtained for the year (2003,2004,2005).- SQL Server 2005 Database.

OPERATIONS: 1. removing noise from data (irregular characters).2. organize data for more efficient access.

Page 60: Text mining

Discovery Co-relations on Research Topics and Authors from the PubMed Database

2. Pre-Processing the Data

DsPM: input dataset through the parsing of the PubMed XML files.

Top-5-KW and Top-1-A: indicate most frequency.

Page 61: Text mining

Discovery Co-relations on Research Topics and Authors from the PubMed Database

3. Mining the PubMed data

Association Rule (AR): A C

Support : number of database entries where this rule appears.Confidence : probability that an entry in DB that contains A will also contain C.

Dependency Networks(DN): are graphical models that represent joint distributions for a set of variables. DN are useful to learn and describe probabilistic relationships on data.

Page 62: Text mining

Discovery Co-relations on Research Topics and Authors from the PubMed Database

3. Mining the PubMed data

Discovering Dependency Networks(DN)

Page 63: Text mining

Discovery Co-relations on Research Topics and Authors from the PubMed Database

3. Mining the PubMed data

Discovering Dependency Networks(DN)

High Confidence value: high probabilityof co-occurrence of author of consequent.

High Lift value: when antecendent occurthere is high probability of co-occurrence of author of consequent.

Page 64: Text mining

What will we approach? Introduction Approach KDT Categorization Classification Mining Task Experiments A survey of Text Mining: Retrieval, Extraction and Indexing Techniques Discovery Co-relations on Research Topics and Authors from the PubMed Database Conclusion Credits

Page 65: Text mining

Conclusion We can saw that: Choose to use a manual task + automatic tools + existing vocabularies. Automatic methods: can help user to find terms related to categories, lexical variations, local

synonymous, frequencies, etc. Human Intervention is important!To evaluate the categorization method, formal experiments were carried out: Texts extracted from web were gathered:

5 TOOLS 13 concepts8 tasks

5 methods

Page 66: Text mining

Conclusion Microaveraging precision = 0.59; Macroaveraging precision = 0.54; Microaveraging recall = 0.95; Macroaveraging recall = 0.86; Fallout = 0.62;

Microaveraging precision = 0.65; Macroaveraging precision = 0.69; Microaveraging recall = 0.89; Macroaveraging recall = 0.93; Fallout = 0.28;

Macroaveraging precision:0.61Macroaveraging recall:0.97

Classification

Negative terms + ambiguous words

Page 67: Text mining

CreditsFERREIRA, P, G.; LIBRELOTTO, Giovani; ALVES, Ronnie. Discovering Co-Relations on Research Topics and Authors from the PubMed Database.

LOH, S.; WIVES, L.K.; OLIVEIRA, J.P.M. Concept-Based Knowledge Discovery in Texts Extracted from the Web.

SAGAYAM, R.; Srinivasan, S.; Roshni.S. A Survey of Text Mining: Retrieval, Extraction and Indexing Techniques.

Physicsandcake. Teaching Artificial Intelligences using Quantum Computers. Disponível em: <http://dwave.wordpress.com/2011/05/27/teaching-artificial-intelligences-using-quantum-computers/>. Acesso 06/10/2013.

Cat Casey. Predictive Analytics and Artificial Intelligence... Science Fiction or E-Discovery Truth?. Disponível em: <http://hudsonlegalblog.com/e-discovery/predictive-analytics-artificial-intelligence-science-fiction-e-discovery-truth.html>. Acesso em 07/10/2013.

Traina, Ribeiro, Cordeiro, Romani, Sousa, Avila, Zullo, Traina, Rodrigues. How to Find Relevant Patterns in Climate Data: an Efficient and Effective Framework to Mine Climate Time Series and Remote Sensing Images. Disponível em: <http://www.gbdi.icmc.usp.br/agrodatamine/node/33>. Acesso em:07/10/2013.