Text Data Mining

INFORMATION EXTRACTION

What is Information Extraction?

Goal:

Extract structured information from unstructured (or loosely formatted) text.

Typical description of task: Identify named entities Identify relations between entities Populate a database

May also include: Event extraction Resolution of temporal expressions Wrapper induction (automatic construction of templates)

Applications: Natural language understanding, Question-answering, summarization, etc.

Information Extraction

IE extracts pieces of information that are salient to the user's needs

Find namedentities such as persons and organizations

Find find attributes of those entities or events they participate in

ContrastIR, which indicates which documents need to be read by

a user

Links between the extracted information and the original documents

are maintained to allow the user to reference context.

Schematic view of the Information Extraction Process

Information Extraction

Relevant IE Definitions

Entities:

Entities are the basic building blocks that can be found in text

documents.(An object of interest)

Examples: people, companies, locations, genes, and drugs.

Attributes:

Attributes are features of the extracted entities. (A property of an entity

such as its name, alias, descriptor or type)

Examples: the title of a person, the age of a person, and the type of an

organization.

Relevant IE Definitions

Facts: Facts are the relations that exist between entities. (a relationship held

between two or more entities such as the position of a person in a company)

Example: Employment relationship between a person and a company or phosphorylation between two proteins.

Events: An event is an activity or occurrence of interest in which entities

participate

An activity involving several entities such as a terrorist act, airline

crash, management change, new product introduction a merger

between two companies, a birthday and so on.

IE - Method

Extract raw text(html, pdf, ps, gif.) Tokenize Detect term boundaries

We extracted alpha 1 type XIII collagen from … Their house council recommended…

Detect sentence boundaries Tag parts of speech (POS)

John/noun saw/verb Mary/noun. Tag named entities

Person, place, organization, gene, chemical. Parse Determine co-reference Extract knowledge

Architecture: Components of IE Systems

Core linguistic components, adapted to or be useful for NLP tasks in general IE-specific components, address the core IE tasks. Domain-Independent Domain-specific components The following steps are performed in Domain-Independent part:

Meta-data analysis: Extraction of the title, body, structure of the body (identification of

paragraphs), and the date of the document. Tokenization:

Segmentation of the text into word-like units, called tokens and classification of their type, e.g., identification of capitalized words, words written in lowercase letters, hyphenated words, punctuation signs, numbers, etc.


Morphological analysis: Extraction of morphological information from tokens which constitute

potential word forms-the base form(or lemma), part of speech, other morphological tags depending on the part of speech.

e.g., verbs have features such as tense, mood, aspect, person, etc. Words which are ambiguous with respect to certain morphological categories

may undergo disambiguation. Typically part-of-speech disambiguation is performed.

Sentence/Utterance boundary detection: Segmentation of text into a sequence of sentences or utterances, each of which

is represented as a sequence of lexical items together with their features. Common Named-entity extraction:

Detection of domain-independent named entities, such as temporal expressions, numbers and currency, geographical references, etc.


Phrase recognition: Recognition of small-scale, local structures such as noun phrases, verb groups,

prepositional phrases, acronyms, and abbreviations. Syntactic analysis:

Computation of a dependency structure (parse tree) of the sentence based on the sequence of lexical items and small-scale structures.

Syntactic analysis may be deep or shallow. In the former case, compute all possible interpretations (parse trees) and

grammatical relations within the sentence. In the latter case, the analysis is restricted to identification of non-recursive

structures or structures with limited amount of structural recursion, which can be identified with a high degree of certainty, and linguistic phenomena which cause problems (ambiguities) are not handled and represented with underspecified structures.


The core IE tasks: NER, Co-reference resolution, and Detection of relations and events

Typically domain-specific, and are supported by domain-specific system components and resources.

Domain-specific processing is also supported on a lower level by detection of specialized terms in text.

Architecture: IE System In the domain specific core of the processing chain, a NER component is

applied to identify the entities relevant in a given domain. Patterns may then be applied to:

Identify text fragments, which describe the target relations and events, and Extract the key attributes to fill the slots in the template representing the

relation/event.

IE System - Architecture

Typical Architecture of an Information Extraction System


A co-reference component identifies mentions that refer to the same entity.

Partially-filled templates are fused and validated using domain-specific

inference rules in order to create full-fledged relation/event descriptions.

Several software packages to provide various tools that can be used in the

process of developing an IE system, ranging from core linguistic processing

modules (e.g., language detectors, sentence splitters), to general IE-oriented

NLP frameworks.

IE Task Types

Named Entity Recognition (NER)

Co-reference Resolution (CO)

Relation Extraction (RE)

Event Extraction (EE)

Named Entity Recognition

Named Entity Recognition (NER) addresses the problem of the identification (detection) and classification of predefined types of named entities, Such as organizations (e.g., ‘World Health Organisation’), persons (e.g., ‘Mohamad

Gouse’), place names (e.g., ‘the Baltic Sea’), temporal expressions (e.g., ‘15 January 1984’), numerical and currency expressions (e.g., ‘20 MillionEuros’), etc.

NER task include extracting descriptive information from the text about the detected entities through filling of a small-scale template. Example, in the case of persons, it may include extracting the title, position,

nationality, gender, and other attributes of the person. NER also involves lemmatization (normalization) of the named entities, which is

particularly crucial in highly inflective languages. Example in Polish there are six inflected forms of the name ‘Mohamad Gouse’

depending on grammatical case: ‘Mohamad Gouse’ (nominative), ‘Mohamad Gouseego’ (genitive), Mohamad Gouseemu (dative), ‘Mohamad Gouseiego’ (accusative), Mohamad Gousem (instrumental), Mohamad Gousem (locative), Mohamad Gouse(vocative).

Co-Reference

Co-reference Resolution (CO) requires the identification of multiple (coreferring) mentions of the same entity in the text.

Entity mentions can be: (a) Named, in case an entity is referred to by name

e.g., ‘General Electric’ and ‘GE’ may refer to the same real-world entity. (b) Pronominal, in case an entity is referred to with a pronoun

e.g., in ‘John bought food. But he forgot to buy drinks.’, the pronoun he refers to John. (c) Nominal, in case an entity is referred to with a nominal phrase

e.g., in ‘Microsoft revealed its earnings. The company also unveiled future plans.’ the definite noun phrase The company refers to Microsoft.

(d) Implicit, as in case of using zero-anaphora1 e.g., in the Italian text fragment ‘OEBerlusconii ha visitato il luogo del disastro. i Ha

sorvolato, con l’elicottero.’ (Berlusconi has visited the place of disaster. [He] flew over with a helicopter.) the

second sentence does not have an explicit realization of the reference to Berlusconi.

Relation Extraction

Relation Extraction (RE) is the task of detecting and classifying predefined

relationships between entities identified in text.

For example:

EmployeeOf(Steve Jobs,Apple): a relation between a person and an

organisation, extracted from ‘Steve Jobs works for Apple’

LocatedIn(Smith,New York): a relation between a person and location,

extracted from ‘Mr. Smith gave a talk at the conference in New York’,

SubsidiaryOf(TVN,ITI Holding): a relation between two companies,

extracted from ‘Listed broadcaster TVN said its parent company, ITI

Holdings, is considering various options for the potential sale.

The set of relations that may be of interest is unlimited, the set of relations

within a given task is predefined and fixed, as part of the specification of the

task.

Event Extraction

Event Extraction (EE) refers to the task of identifying events in free text and deriving detailed and structured information about them, ideally identifying who did what to whom, when, where, through what methods (instruments), and why.

Usually, event extraction involves extraction of several entities and relationships between them.

For instance, extraction of information on terrorist attacks from the text fragment ‘Masked gunmen armed with assault rifles and grenades attacked a wedding party in mainly Kurdish southeast Turkey, killing at least 44 people.’ Involves identification of perpetrators (masked gunmen), victims (people),

number of killed/injured (at least 44), weapons and means used (rifles and grenades), and location (southeast Turkey).

Another example is the extraction of information on new joint ventures, where the aim is to identify the partners, products, profits and capitalization of the joint venture.

EE is considered to be the hardest of the four IE tasks.

IE Subtask: Named Entity Recognition

Detect and classify all proper names mentioned in text

What is a proper name? Depends on application.

People, places, organizations, times, amounts, etc.

Names of genes and proteins

Names of college courses

NER Example

Find extent of each mention Classify each mention Sources of ambiguity

Different strings that map to the same entity Equivalent strings that map to different entities (e.g., U.S. Grant)

Approaches to NER

Early systems: hand-written rules

Statistical systems

Supervised learning (HMMs, Decision Trees, MaxEnt, SVMs, CRFs)

Semi-supervised learning (bootstrapping)

Unsupervised learning (rely on lexical resources, lexical patterns, and

corpus statistics)

A Sequence-Labeling Approach using CRFs

Input: Sequence of observations (tokens/words/text) Output: Sequence of states (labels/classes)

B: Begin I: Inside O: Outside Some evidence that including L (Last) and U (Unit length) is

advantageous (Ratinov and Roth 09) CRFs defines a conditional probability p(Y|X) over label sequences Y

given an observation sequence X No effort wasted modeling the observations (in contrast to joint

models like HMMs) Arbitrary features of the observations may be captured by the model

Linear Chain CRFs

Simplest and most common graph structure, used for sequence modeling

Inference can be done efficiently using dynamic programming O(|X||Y|2)

Linear Chain CRFs

NER Features

Several feature families used, all time-shifted by -2, -1, 0, 1, 2:

The word itself

Capitalization and digit patterns (shape patterns)

8 lexicons entered by hand (e.g., honorifics, days, months)

15 lexicons obtained from web sites (e.g., countries, publicly-traded

companies, surnames, stopwords, universities)

25 lexicons automatically induced from the web (people names,

organizations, NGOs, nationalities)

Limitations of Conventional NER(and IE)

Supervised learning

Expensive

Inconsistent

Worse for relations and events!

Fixed, narrow, pre-specified sets of entity types

Small, homogeneous corpora (newswire, seminar announcements)

Evaluating Named Entity Recognition

Recall that recall is the ratio of the number of correctly labeled responses to the total that should have been labeled.

Precision is the ratio of the number of correctly labeled responses to the total labeled.

The F-measure provides a way to combine these two measures into a single metric.

key

correct

N

Nrecall

incorrectcorrect

correct

NN

Nprecision

recallprecision

recallprecisionF

2

2 )1(

What is Relation Extraction?

Typically defined as identifying relations between two entities

Relations Subtypes Examples

AffiliationsPersonal

OrganizationalArtifactual

married to, mother of spokesman for, president of owns, invented, produces

GeospatialProximityDirectional

near, on outskirtssoutheast of

Part-ofOrganizational

Politicala unit of, parent ofannexed, acquired

Typical (Supervised) Approach

FindEntities( ): Named entity recognizer Related?( ): Binary classier that says whether two entities are involved

in a relation ClassifyRelation( ): Classier that labels relations discovered by

Related?( )

Typical (Semi-Supervised) Approach

NELL: Never-Ending Language Learner

NELL: Can computers learn to read?

Goal: create a system that learns to read the web

Reading task: Extract facts from text found on the web

Learning task: Iteratively improve reading competence.

http://rtw.ml.cmu.edu/rtw/

Approach

Inputs

Ontology with target categories and relations (i.e., predicates)

Small number of seed examples for each

Set of constraints that couple the predicates

Large corpus of unlabeled documents

Output: new predicate instances

Semi-supervised bootstrap learning methods

Couple the learning of functions to constrain the problem

Exploit redundancy of information on the web.

Coupled Semi-Supervised Learning

Types of Coupling

1. Mutual Exclusion (output constraint) Mutually exclusive predicates can't both be satisfied by the same input x E.g., x cannot be a Person and a Sport

2. Relation Argument Type-Checking (compositional constraint) Arguments of relations declared to be of certain categories E.g., CompanyIsInEconomicSector(Company, EconomicSector)

3. Unstructured and Semi-Structured Text Features

(multi-view-agreement constraint) Look at different views (like co-training) Require classifiers agree E.g., freeform textual contexts and semi-structured contexts

System Architecture

Coupled Pattern Learner (CPL)

Free-text extractor that learns contextual patterns to extract predicate instances

Use mutual exclusion and type-checking constraints to filter candidates instances

Rank instances and patterns by leveraging redundancy: if an instance or pattern occurs more frequently, it's ranked higher

Coupled SEAL (CSEAL)

SEAL (Set Expander for Any Language) is a wrapper induction algorithm

Operates over semi-structured text such as web pages

Constructs page-specific extraction rules (wrappers) that are human- and

markup-language independent

CSEAL adds mutual-exclusion and type-checking constraints

CSEAL Wrappers

Seeds: Ford, Nissan, Toyota arg1 is a placeholder for extracting instances

Open IE and TextRunner

Motivations: Web corpora are massive, introducing scalability concerns Relations of interest are unanticipated, diverse, and abundant Use of “heavy” linguistic technology (NERs and parsers) don't work

well Input: a large, heterogeneous Web corpus

9M web pages, 133M sentences No pre-specified set of relations

Output: huge set of extracted relations 60.5M tuples, 11.3M high-probability tuples Tuples are indexed for searching

TextRunner Architecture

Learner outputs a classier that labels trustworthy extractions Extractor finds and outputs trustworthy extractions Assessor normalizes and scores the extractions

Architecture: Self-Supervised Learner

1. Automatically labels training data Uses a parser to induce dependency structures Parses a small corpus of several thousand sentences Identifies and labels a set of positive and negative extractions using

relation-independent heuristics An extraction is a tuple t = (ej , ri,j , ej)

Entities are base noun phrases Uses parse to identify potential relations

2. Trains a classifier Domain-independent, simple non-parse features E.g., POS tags, phrase chunks, regexes, stopwords, etc.

Architecture: Single-Pass Extractor

1. POS tag each word

2. Identify entities using lightweight NP chunker

3. Identify relations

4. Classify them

Architecture: Redundancy-Based Assessor

Take the tuples and perform

Normalization, deduplication, synonym resolution

Assessment

Number of distinct sentences from which each extraction was found serves

as a measure of confidence

Entities and relations indexed using Lucene

Template Filling

The task of template-filling is to find Template filling documents that

evoke such situations and then fill the slots in templates with appropriate

material.

These slot fillers may consist of

Text segments extracted directly from the text, or

Concepts that have been inferred from text elements via some

additional processing (times, amounts, entities from an ontology, etc.).

Applications of IE

Infrastructure for IR and for Categorization

Information Routing

Event Based Summarization

Automatic Creation of Databases

Company acquisitions

Sports scores

Terrorist activities

Job listings

Corporate titles and addresses

Inductive Algorithms for IE

Rule Induction algorithms produce symbolic IE rules based on a corpus of

annotated documents.

WHISK

BWI

The (LP)2 Algorithm

The inductive algorithms are suitable for semi-structured domains, where

the rules are fairly simple, whereas when dealing with free text documents

(such as news articles) the probabilistic algorithms perform much better.

WHISK

WHISK is a supervised learning algorithm that uses hand-tagged

examples for learning information extraction rules.

Works for structured, semi-structured and free text.

Extract both single-slot and multi-slot information.

Doesn’t require syntactic preprocessing for structured and semi-structured

text, and recommend syntactic analyzer and semantic tagger for free text.

The extraction pattern learned by WHISK is in the form of limited regular

expression, considering tradeoff between expressiveness and efficiency.

Example: IE task of extracting neighborhood, number of bedrooms and

price from the text

WHISK

An Example from the Rental Ads domain

An example extraction pattern which can be learned by WHISK is,

*(Neighborhood) *(Bedroom) * ‘$’(Number)

Neighborhood, Bedroom, and Number – Semantic classes specified by domain experts.

WHISK learns the extraction rules using a top-down covering algorithm. The algorithm begins learning a single rule by starting with an empty rule; Then add one term at a time until either no negative examples are covered

by the rule or the pre-pruning criterion has been satisfied.

We add terms to specialize it in order to reduce the Laplacian error of the rule. The Laplacian expected error is defined as,

Where, e is the number of negative extraction and

n is the number of positive extractions on the training instances (terms)

Example:

For instance, from the text “3 BR, upper flr of turn of ctry. Incl gar, grt N. Hill loc 995$. (206)-999-9999,” the rule would extract the frame Bedrooms – 3, Price – 995.

The “*” char in the pattern will match any number of characters (unlimited jump).

Patterns enclosed in parentheses become numbered elements in the output pattern, and hence (Digit) is $1 and (number) is $2.

1

1

n

eLaplacian

Boosted Wrapper Induction(BWI)

The BWI is a system that utilizes wrapper induction techniques for

traditional Information Extraction.

IE is treated as a classification problem that entails trying to approximate

two boundary functions Xbegin(i ) and Xend(i ).

Xbegin(i ) is equal to 1 if the ith token starts a field that is part of the frame to

be extracted and 0 otherwise. Xend(i ) is defined in a similar way for tokens that end a field.

The learning algorithm approximates each X function by taking a set of pairs of the form {i, X}(i) as training data.

Each field is extracted by a wrapper W=<F, A, H> where F is a set of begin boundary detectors A is a set of end boundary detectors H(k) is the probability that the field has length k A boundary detector is just a sequence of tokens with wild cards (some kind of a

regular expression).

W(i, j ) is a nave Bayesian approximation of the probability The BWI algorithm learns two detectors by using a greedy algorithm that extends the

prefix and suffix patterns while there is an improvement in the accuracy. The sets F(i) and A(i) are generated from the detectors by using the AdaBoost

algorithm. The detector pattern can include specific words and regular expressions that work on a

set of wildcards such as <num>, <Cap>, <LowerCase>, <Punctuation> and <Alpha>.

otherwise

ijHjAiFifjiW

0

)1()()(1),(

),()( iFCiF kk

FK )()( iACiAk

kAk

(LP)2 Algorithm

The (LP)2 algorithm learns from an annotated corpus and induces two sets of rules: Tagging rules generated by a bottom-up generalization process correction rules that correct mistakes and omissions done by the

tagging rules. A tagging rule is a pattern that contains conditions on words preceding the

place where a tag is to be inserted and conditions on the words that follow the tag.

Conditions can be either words, lemmas, lexical categories (such as digit, noun, verb, etc), case (lower or upper), and semantic categories (such as time-id, cities, etc).

The (LP)2 algorithm is a covering algorithm that tries to cover all training examples.

The initial tagging rules are generalized by dropping conditions.

IE and Text Summarization User’s perspective,

IE can be glossed as "I know what specific pieces of information I want–just find them for me!",

Summarization can be glossed as "What’s in the text that is interesting?". Technically, from the system builder’s perspective, the two applications blend into

each other. The most pertinent technical aspects are:

Are the criteria of interestingness specified at run-time or by the system builder?

Is the input a single document or multiple documents? Is the extracted information manipulated, either by simple content delineation

routines or by complex inferences, or just delivered verbatim? What is the grain size of the extracted units of information–individual entities

and events, or blocks of text? Is the output formulated in language, or in a computer-internal knowledge

representation?

Text Summarization

An information access technology that given a

document or sets of related documents, extracts the

most important content from the source(s) taking into

account the user or task at hand, and presents this

content in a well formed and concise text

Text Summarization Techniques

Topic Representation

Influence of Context

Indicator Representations

Pattern Extraction

Text Summarization

Input: one or more text documents

Output: paragraph length summary

Sentence extraction is the standard method Using features such as key words, sentence position in document,

cue phrases Identify sentences within documents that are salient Extract and string sentences together

Machine learning for extraction Corpus of document/summary pairs Learn the features that best determine important sentences Summarization of scientific articles

A Summarization Machine

EXTRACTS

ABSTRACTS

?

MULTIDOCS

Extract Abstract

Indicative

Generic

Background

Query-oriented

Just the news

10%

50%

100%

Very BriefBrief

Long

Headline

Informative

DOCQUERY

CASE FRAMESTEMPLATESCORE CONCEPTSCORE EVENTSRELATIONSHIPSCLAUSE FRAGMENTSINDEX TERMS

The Modules of the Summarization Machine

EXTRACTION

INTERPRETATION

EXTRACTS

ABSTRACTS

?

CASE FRAMESTEMPLATESCORE CONCEPTSCORE EVENTSRELATIONSHIPSCLAUSE FRAGMENTSINDEX TERMS

MULTIDOC EXTRACTS

GENERATION

FILTERING

DOCEXTRACTS

What is Summarization?What is Summarization?

Data as input (database, software trace, expert system), text summary as output

Text as input (one or more articles), paragraph summary as output

Multimedia in input or output

Summaries must convey maximal information in minimal space

Involves: Three stages (typically) Content identification

Find/Extract the most important material Conceptual organization Realization

Types of summaries

Purpose Indicative, informative, and critical summaries

Form Extracts (representative paragraphs/sentences/phrases) Abstracts: “a concise summary of the central subject matter of a

document”. Dimensions

Single-document vs. multi-document Context

Query-specific vs. query-independent Generic vs. query-oriented

provides author’s view vs. reflects user’s interest.

Genres

Headlines

Outlines

Minutes

Biographies

Abridgments

Sound bites

Movie summaries

Chronologies, etc.

Aspects that Describe Summaries

Input subject type: domain genre: newspaper articles, editorials, letters, reports... form: regular text structure, free-form source size: single doc, multiple docs (few,many)

Purpose situation: embedded in larger system (MT, IR) or not? audience: focused or general usage: IR, sorting, skimming...

Output completeness: include all aspects, or focus on some? format: paragraph, table, etc. style: informative, indicative, aggregative, critical...

Single Document SummarizationSystem Architecture

Extraction

Sentence reduction

Generation

Sentence combination

Input: single document

Extracted sentences

Output: summary

Corpus

Decomposition

Lexicon

Parser

Co-reference

Multi-Document Summarization

Monitor variety of online information sources News, multilingual Email

Gather information on events across source and time Same day, multiple sources Across time

Summarize Highlighting similarities, new information, different perspectives,

user specified interests in real-time

Example System: SUMMARIST

Three stages:

1. Topic Identification Modules: Positional Importance, Cue Phrases (under

construction), Word Counts, Discourse Structure (under construction), ...

2. Topic Interpretation Modules: Concept Counting /Wavefront, Concept

Signatures (being extended)

3. Summary Generation Modules (not yet built): Keywords, Template Gen, Sent.

Planner & Realizer

SUMMARY = TOPIC ID + INTERPRETATION + GENERATION

From extract to abstract:

topic interpretation or concept fusion.

Experiment (Marcu, 98): Got 10 newspaper texts, with

human abstracts. Asked 14 judges to extract

corresponding clauses from texts, to cover the same content.

Compared word lengths of extracts to abstracts: extract_length 2.76 abstract_length !!

xx xxx xxxx x xx xxxx xxx xx xxx xx xxxxx xxxx xx xxx xx x xxx xx xx xxx x xxx xx xxx x xx x xxxx xxxx xxxx xxxx xxxx xxxxxx xx xx xxxx x xxxxx x xx xx xxxxx x x xxxxx xxxxxx xxxxxx x xxxxxxxx xx x xxxxxxxxxx xx xx xxxxx xxx xx x xxxx xxxx xxx xxxx xx

Topic Interpretation

xxx xx xxx xxxx xxxxx x xxxx x xx xxxxxx xxx xxxx xx x xxxxxx xxxx x xxx x xxxxx xx xxxxx x x xxxxxxxxx xx x xxxxxxxxxx xx xx xxxxx xxx xxxxx xx xxxx x xxxxxxx xxxxx x

Some Types of Interpretation

Concept generalization:

Sue ate apples, pears, and bananas Sue ate fruit

Meronymy replacement:

Both wheels, the pedals, saddle, chain… the bike

Script identification:

He sat down, read the menu, ordered, ate, paid, and left He ate at the

restaurant

Metonymy:

A spokesperson for the US Government announced that… Washington

announced that...

General Aspects of Interpretation

Interpretation occurs at the conceptual level...

…words alone are polysemous (bat animal and sports

instrument) and combine for meaning (alleged murderer

murderer).

For interpretation, you need world knowledge...

…the fusion inferences are not in the text!

Extract a pattern for each event in training data part of speech & mention tags

Example: Japanese political leaders GPE JJ PER

Japanese political Leaders

GPE PER

NN JJ NN

GPE JJ PER

Text

Ents

POS

Pattern

Pattern Extraction

Summarization - Scope

Data preparation: Collect large sets of texts with abstracts, all genres. Build large corpora of <Text, Abstract, Extract> tuples. Investigate relationships between extracts and abstracts (using <Extract,

Abstract> tuples). Types of summary:

Determine characteristics of each type. Topic Identification:

Develop new identification methods (discourse, etc.). Develop heuristics for method combination (train heuristics on <Text,

Extract> tuples).

Summarization - Scope

Concept Interpretation (Fusion): Investigate types of fusion (semantic, evaluative…). Create large collections of fusion knowledge/rules (e.g., signature

libraries, generalization and partonymic hierarchies, metonymy rules…).

Study incorporation of User’s knowledge in interpretation. Generation:

Develop Sentence Planner rules for dense packing of content into sentences (using <Extract, Abstract> pairs).

Evaluation: Develop better evaluation metrics, for types of summaries.

Apriori Algorithm

In computer science and data mining, Apriori is a classic algorithm for learning association rules.

Apriori is designed to operate on databases containing transactions. Example, collections of items bought by customers, or details of a

website frequentation. The algorithm attempts to find subsets which are common to at least a

minimum number C (the cutoff, or confidence threshold) of the itemsets. Apriori uses a "bottom up" approach, where frequent subsets are extended

one item at a time (a step known as candidate generation, and groups of candidates are tested against the data.

The algorithm terminates when no further successful extensions are found. Apriori uses breadth-first search and a hash tree structure to count candidate

item sets efficiently.

Find rules in two stages

Agarwal et.al., divided the problem of finding good rules into two phases:

1. Find all itemsets with a specified minimal support (coverage). An itemset

is just a specific set of items, e.g. {apples, cheese}. The Apriori algorithm

can efficiently find all itemsets whose coverage is above a given

minimum.

2. Use these itemsets to help generate interersting rules. Having done stage

1, we have considerably narrowed down the possibilities, and can do

reasonably fast processing of the large itemsets to generate candidate

rules.

Terminology

k-itemset : a set of k items. E.g.

{beer, cheese, eggs} is a 3-itemset

{cheese} is a 1-itemset

{honey, ice-cream} is a 2-itemset

support: an itemset has support s% if s% of the records in the DB contain that

itemset.

minimum support: the Apriori algorithm starts with the specification of a

minimum level of support, and will focus on itemsets with this level or

above.

Terminology

large itemset: doesn’t mean an itemset with many items. It means one

whose support is at least minimum support.

Lk : the set of all large k-itemsets in the DB.

Ck : a set of candidate large k-itemsets. In the algorithm we will look at, it

generates this set, which contains all the k-itemsets that might be large,

and then eventually generates the set above.

Terminology

sets: Let A be a set (A = {cat, dog}) and

let B be a set (B = {dog, eel, rat}) and

let C = {eel, rat}

I use ‘A + B’ to mean A union B.

So A + B = {cat, dog, eel, rat}

When X is a subset of Y, I use Y – X to mean the set of things in Y which are not in X.

E.g. B – C = {dog}

Apriori Algorithm

Find all large 1-itemsetsFor (k = 2 ; while Lk-1 is non-empty; k++)

{Ck = apriori-gen(Lk-1)For each c in Ck, initialise c.count to zero

For all records r in the DB {Cr = subset(Ck, r); For each c in Cr , c.count++ }

Set Lk := all c in Ck whose count >= minsup

} /* end -- return all of the Lk sets.

The algorithm returns all of the (non-empty) Lk sets, which gives us an excellent

start in finding interesting rules (although the large itemsets themselves will

usually be very interesting and useful.

Example: Generation of candidate itemsets and frequent itemsets, where the minimum support count is 2.

Apriori Merits/Demerits

Merits

Uses large itemset property

Easily parallelized

Easy to implement

Demerits

Assumes transaction database is memory resident.

Requires many database scans.

Summary

Association Rules form an very applied data mining approach.

Association Rules are derived from frequent itemsets.

The Apriori algorithm is an efficient algorithm for finding all frequent

itemsets.

The Apriori algorithm implements level-wise search using frequent

item property.

The Apriori algorithm can be additionally optimized.

There are many measures for association rules.

FP-Growth Algorithm

Frequent Pattern Mining: An Example

Given a transaction database DB and a minimum support threshold ξ, Find all frequent patterns (item sets) with support no less than ξ.

TID Items bought 100 {f, a, c, d, g, i, m, p}200 {a, b, c, f, l, m, o}

300 {b, f, h, j, o}

400 {b, c, k, s, p}

500 {a, f, c, e, l, p, m, n}

DB:

Minimum support: ξ =3

Input:

Output: all frequent patterns, i.e., f, a, …, fa, fac, fam, fm,am…

Problem Statement: How to efficiently find all frequent patterns?

Compress a large database into a compact, Frequent-Pattern tree (FP-tree)

structure

highly compacted, but complete for frequent pattern mining

avoid costly repeated database scans

Develop an efficient, FP-tree-based frequent pattern mining method (FP-

growth)

A divide-and-conquer methodology: decompose mining tasks into

smaller ones

Avoid candidate generation: sub-database test only.

Overview of FP-Growth: Ideas

FP-tree: Construction and Design

Construct FP-tree

Two Steps:

1. Scan the transaction DB for the first time, find frequent items (single item

patterns) and order them into a list L in frequency descending order.

e.g., L={f:4, c:4, a:3, b:3, m:3, p:3}

In the format of (item-name, support)

2. For each transaction, order its frequent items according to the order in L;

Scan DB the second time, construct FP-tree by putting each frequency

ordered transaction onto it.

89

FP-tree Example: step 1

Item frequency f 4c 4a 3b 3m 3p 3

TID Items bought 100 {f, a, c, d, g, i, m, p}200 {a, b, c, f, l, m, o}

300 {b, f, h, j, o}400 {b, c, k, s, p}500 {a, f, c, e, l, p, m, n}

L

Step 1: Scan DB for the first time to generate L

By-Product of First Scan of Database

90


TID Items bought (ordered) frequent items100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}

Step 2: scan the DB for the second time, order frequent items in each transaction

91


Step 2: construct FP-tree

{}

f:1

c:1

a:1

m:1

p:1

{f, c, a, m, p}

{}

{}

f:2

c:2

a:2

b:1m:1

p:1 m:1

{f, c, a, b, m}

NOTE: Each transaction

corresponds to one path in the FP-tree

92


{}

f:4 c:1

b:1

p:1

b:1c:3

a:3

b:1m:2

p:2 m:1

Step 2: construct FP-tree

{}

f:3

c:2

a:2

b:1m:1

p:1 m:1

{f, b}

b:1

{c, b, p}

c:1

b:1

p:1

{}

f:3

c:2

a:2

b:1m:1

p:1 m:1

b:1{f, c, a, m, p}

Node-Link

93

Construction Example

Final FP-tree

{}

f:4 c:1

b:1

p:1

b:1c:3

a:3

b:1m:2

p:2 m:1

Header Table

Item head fcabmp

FP-Tree Definition

FP-tree is a frequent pattern tree . Formally, FP-tree is a tree structure defined below:

1. One root labeled as “null", a set of item prefix sub-trees as the children of the root, and a frequent-item header table.

2. Each node in the item prefix sub-trees has three fields: item-name : register which item this node represents, count, the number of transactions represented by the portion of the path

reaching this node, node-link that links to the next node in the FP-tree carrying the same

item-name, or null if there is none.

3. Each entry in the frequent-item header table has two fields, item-name, and head of node-link that points to the first node in the FP-tree carrying the

item-name.

Advantages of the FP-tree Structure

The most significant advantage of the FP-tree

Scan the DB only twice and twice only.

Completeness:

The FP-tree contains all the information related to mining frequent patterns

(given the min-support threshold).

Compactness:

The size of the tree is bounded by the occurrences of frequent items

The height of the tree is bounded by the maximum number of items in a

transaction

FP-growth:Mining Frequent Patterns

Using FP-tree

Mining Frequent Patterns Using FP-tree

General idea (divide-and-conquer)

Recursively grow frequent patterns using the FP-tree: looking for shorter

ones recursively and then concatenating the suffix:

For each frequent item, construct its conditional pattern base, and then its

conditional FP-tree;

Repeat the process on each newly created conditional FP-tree until the

resulting FP-tree is empty, or it contains only one path (single path will

generate all the combinations of its sub-paths, each of which is a frequent

pattern)

3 Major Steps

Starting the processing from the end of list L:

Step 1:

Construct conditional pattern base for each item in the header table

Step 2:

Construct conditional FP-tree from each conditional pattern base

Step 3:

Recursively mine conditional FP-trees and grow frequent patterns

obtained so far. If the conditional FP-tree contains a single path, simply

enumerate all the patterns

Step 1: Construct Conditional Pattern Base

Starting at the bottom of frequent-item header table in the FP-tree Traverse the FP-tree by following the link of each frequent item Accumulate all of transformed prefix paths of that item to form a

conditional pattern base

Conditional pattern bases

item cond. pattern base

p fcam:2, cb:1

m fca:2, fcab:1

b fca:1, f:1, c:1

a fc:3

c f:3

f { }

{}

f:4 c:1

b:1

p:1

b:1c:3

a:3

b:1m:2

p:2 m:1

Header Table

Item head fcabmp

Properties of FP-Tree

Node-link property

For any frequent item ai, all the possible frequent patterns that contain ai

can be obtained by following ai's node-links, starting from ai's head in

the FP-tree header.

Prefix path property

To calculate the frequent patterns for a node ai in a path P, only the

prefix sub-path of ai in P need to be accumulated, and its frequency

count should carry the same count as node ai.

Step 2: Construct Conditional FP-tree

For each pattern base Accumulate the count for each item in the base Construct the conditional FP-tree for the frequent items of the pattern

base

m- cond. pattern base:fca:2, fcab:1

{}

f:3

c:3

a:3m-conditional FP-tree

{}

f:4

c:3

a:3

b:1m:2

m:1

Header TableItem head f 4c 4a 3b 3m 3p 3

Step 3: Recursively mine the conditional FP-tree

{}

f:3

c:3

a:3

conditional FP-tree of “am”: (fc:3)

{}

f:3

c:3

conditional FP-tree of “cm”: (f:3)

{}

f:3

conditional FP-tree of

“cam”: (f:3){}

f:3

conditional FP-tree of “fm”: 3

conditional FP-tree ofof “fam”: 3

conditional FP-tree of “m”: (fca:3)

add“a”

add“c”

add“f”

add“c”

add“f”

Frequent Pattern fcam

add“f”

conditional FP-tree of “fcm”: 3

Frequent Pattern Frequent Pattern

Frequent Pattern

Frequent Pattern

Frequent Pattern

Frequent Pattern

Frequent Pattern

add“f”

Principles of FP-Growth

Pattern growth property

Let be a frequent itemset in DB, B be 's conditional pattern base,

and be an itemset in B. Then is a frequent itemset in DB iff is

frequent in B.

Is “fcabm ” a frequent pattern?

“fcab” is a branch of m's conditional pattern base

“b” is NOT frequent in transactions containing “fcab ”

“bm” is NOT a frequent itemset.

Conditional Pattern Bases and Conditional FP-Tree

EmptyEmptyf

{(f:3)}|c{(f:3)}c

{(f:3, c:3)}|a{(fc:3)}a

Empty{(fca:1), (f:1), (c:1)}b

{(f:3, c:3, a:3)}|m{(fca:2), (fcab:1)}m

{(c:3)}|p{(fcam:2), (cb:1)}p

Conditional FP-treeConditional pattern baseItem

order of L

Single FP-tree Path Generation

Suppose an FP-tree T has a single path P.

The complete set of frequent pattern of T can be generated by enumeration

of all the combinations of the sub-paths of P

{}

f:3

c:3

a:3

m-conditional FP-tree

All frequent patterns concerning m: combination of {f, c, a} and mm, fm, cm, am, fcm, fam, cam, fcam

Summary of FP-Growth Algorithm

Mining frequent patterns can be viewed as first mining 1-itemset and

progressively growing each 1-itemset by mining on its conditional pattern base

recursively

Transform a frequent k-itemset mining problem into a sequence of k frequent 1-

itemset mining problems via a set of conditional pattern bases

Efficiency Analysis

Facts: usually

1. FP-tree is much smaller than the size of the DB

2. Pattern base is smaller than original FP-tree

3. Conditional FP-tree is smaller than pattern base

Mining process works on a set of usually much smaller pattern

bases and conditional FP-trees

Divide-and-conquer and dramatic scale of shrinking

Performance Improvement

Projected DBsDisk-resident FP-tree

FP-tree Materialization

FP-tree Incremental update

partition the DB into a set of projected DBs and then construct an FP-tree and mine it in each projected DB.

Store the FP-tree in the hark disks by using B+ tree structure to reduce I/O cost.

a low ξ may usually satisfy most of the mining queries in the FP-tree construction.

How to update an FP-tree when there are new data? • Reconstru

ct the FP-tree

• Or do not update the FP-tree

Documents

Text Data Mining