29
Web-based Information Architectures Jian Zhang

Web-based Information Architectures

  • Upload
    gerd

  • View
    22

  • Download
    0

Embed Size (px)

DESCRIPTION

Web-based Information Architectures. Jian Zhang. Today’s Topics. Term Weighting Scheme Vector Space Model & GVSM Evaluation of IR Rocchio Feedback Web Spider Algorithm Text Mining: Named Entity Identification Data Mining Text Categorization (kNN). Term Weighting Scheme. TW = TF * IDF - PowerPoint PPT Presentation

Citation preview

Page 1: Web-based Information Architectures

Web-based Information Architectures

Jian Zhang

Page 2: Web-based Information Architectures

Today’s Topics

• Term Weighting Scheme

• Vector Space Model & GVSM

• Evaluation of IR

• Rocchio Feedback

• Web Spider Algorithm

• Text Mining: Named Entity Identification

• Data Mining

• Text Categorization (kNN)

Page 3: Web-based Information Architectures

Term Weighting Scheme

• TW = TF * IDF – TF part = f1(tf(term, doc))

– IDF part = f2(idf(term)) = f2(N/df(term))

– E.g., f1(tf) = normalized_tf = tf/max_tf; f2(idf) = log2(idf)

– E.g, f1(tf) = tf; f2(idf) = 1

NOTE: definition of DF!

Page 4: Web-based Information Architectures

Document & Query Representation

• Bag of words, Vector Space Model(VSM)• Word Normalization

– Stopwords removal– Stemming

• Proximity phrases• Each element of the vector is the Term

Weight of that term w.r.t the document/query.

Page 5: Web-based Information Architectures

Similarity Measure

• Dot Product:

n

iii

nn

vuvu

uuuuvvvv

1

2121 ],,,[];,,,[

Page 6: Web-based Information Architectures

Similarity Measure

• Cosine Similarity:

vu

vuvu

uuvvn

ii

n

ii

),cos(

;;1

2

1

2

Page 7: Web-based Information Architectures

Information Retrieval

• Basic assumption: Shared words between query and document

• Similarity measures– Dot product– Cosine similarity (normalized)

Page 8: Web-based Information Architectures

Evaluation

• Recall = a/(a+c)

• Precision = a/(a+b)

• F1=2.0*recall*precision / (recall+precision)

• Accuracy – Bad for IR,

Page 9: Web-based Information Architectures

Refinement of VSM

• Query expansion

• Relevance Feedback– Rocchio Formula: …

Alpha, beta, gamma and their meanings

Page 10: Web-based Information Architectures

Generalized Vector Space Model• Given a collection of training data, present

each term as a n-dimensional vectorD1 D2 … Dj … Dn

T1 w11 w12 … w1j … w1n

T2 w21 w22 … w2j … w2n

… … … … … … …

Ti wi1 wi2 … wij … win

… … … … … … …

Tm wm1 wm2 … wmj … wmn

Page 11: Web-based Information Architectures

GVSM (2)• Define similarity between term ti and tj

Sim(ti, tj) = cos(ti, tj)

• Similarity between qury and document is based on the term-term similarity– For each query term qi, find the term tD in the document

D that is most similar to qi. This value viD, can be considered as the similarity between a sigle word query qi and the document D.

– Sum up the similarities between each query term and the document D. This is considered the similarity between the query and the document D.

Page 12: Web-based Information Architectures

GVSM (3)

Sim(Q,D) = Σi[Maxj(sim(qi, dj)]

or normalizing for document & query length:

Simnorm(Q, D) =||||

)],(([

DQ

dqsimMax ji

Page 13: Web-based Information Architectures

Maximal Marginal Relevance

• Redundancy reduction

• Getting more novel things

• Formula

MMR(Q, C, R) =

Argmaxkdi in C[λS(Q, di) - (1-λ)maxdj

in R (S(di, dj))]

Page 14: Web-based Information Architectures

MMR Example (Summarization)

S1

S2

S3

S4

S5

S6

S1

S3

S4

Full Text

SummaryQuery

Page 15: Web-based Information Architectures

MMR Example (Summarization)Select first sentence: λ=0.7

S1

S2

S3

S4

S5

S6

S3

Full Text

Summary

Query0.4

0.3

0.6

0.2

0.2

0.3Sim(Q, S) = Q . S / (|Q||S|)

Page 16: Web-based Information Architectures

MMR Example (Summarization)Select second sentence

S1

S2

S3

S4

S5

S6

S3

Full Text

Summary

Query

0.150.1

0.2

0.5

0.5

S3

S1

Page 17: Web-based Information Architectures

S4

S1

S2

S3

S4

S5

S6

S1

Full Text

Summary

Query

0.2

0.1

0.4

0.6

S3

S1

MMR Example (Summarization)Select third sentence

Page 18: Web-based Information Architectures

Text Categorization

Task• You want to classify a document to some

categories automatically. For example, the categories are "weather" and "sport".

• To do that, you can use kNN algorithm.• To use kNN, you need a collection of

documents, each of them is labeled to some categories by human.

Page 19: Web-based Information Architectures

Text CategorizationProcedure• Using VSM represent each document in the

training data• Using VSM represent the document to be

categorized (new document).• Use cosine (or some other measures, but cosine is

good here, why) find top k documents (k nearest neighbors ) in the training data that are similar to the new document.

• Decide from the k nearest neighbors what are the categories for the new document

Page 20: Web-based Information Architectures

Web Spider

• The web graph at any instant of time contains k-connected subgraphs

• The spider algorithm given in class is a depth first search through a web subgraph

• Avoiding respidering the same page• Completeness is not guaranteed. Partial

solution is to get seed URLs as diverse as possible.

Page 21: Web-based Information Architectures

Web SpiderPROCEDURE SPIDER4(G, {SEEDS})

Initialize COLLECTION <big file of URL-page pairs>Initialize VISITED <big hash-table>

For every ROOT in SEEDSInitialize STACK <stack data structure>

Let STACK := push(ROOT, STACK)

While STACK is not empty,

Do URLcurr := pop(STACK)

Until URLcurr is not in VISITED

insert-hash(URLcurr, VISITED)

PAGE := look-up(URLcurr)

STORE(<URLcurr, PAGE>, COLLECTION)

For every URLi in PAGE,

push(URLi, STACK)Return COLLECTION

Page 22: Web-based Information Architectures

Text Mining

Components of Text Mining

• Categorization by topic or Genre

• Fact extraction from text

• Data Mining from DBs or extracted facts

Page 23: Web-based Information Architectures

Fact extraction from text

• Named Entity Identification

FSA/FST, HMM

• Role-Situated Named Entities

Apply context information

• Information Extraction

Template matching

Page 24: Web-based Information Architectures

Named Entity IdentificationDefinition of A Finite State Acceptor (FSA)• With an input source (e.g. string of words)• Outputs "YES" or "NO"Definition of A Finite State Transducer (FST)• An FSA with variable binding• Outputs "NO" or "YES"+variable-bindings• Variable bindings encode recognized entity

e.g. "YES <firstname Hideto> <lastname Suzuki>"

Page 25: Web-based Information Architectures

Named Entity Identification

Example. Identify numbers:

1, 2.0, -3.22, +3e2, 4e-5

D = {0,1,2,3,4,5,6,7,8,9}

+- D

D e

.D

e+- D

D

D

D

D

Start

Page 26: Web-based Information Architectures

Data Mining• Learning by caching

– What/when to cache– When to use/invalidate/update cache

• Learning from Examples(a.k.a, "Supervised" learning)– Labeled examples for training– Learn the mapping from examples to labels– E.g.: Naive Bayes, Decision Trees, ...– Text Categorization (using kNN or other means)

is a learning-from-examples task

Page 27: Web-based Information Architectures

Data Mining• "Speedup" Learning

– Tuning search heuristics from experience– Inducing explicit control knowledge– Analogical learning (generalized instances)

• Optimization "policy" learning– Predicting continuous objective function– E.g. Regression, Reinforcement, ...

• New Pattern Discovery(aka "Unsupervised" Learning)– Finding meaningful correlations in data– E.g. association rules, clustering, ...

Page 28: Web-based Information Architectures

Generalize v.s. Specialize

• Generalize:

First, each record in your database is a RULE

Then, generalize (how?, when to stop?)

• Specialize:

First, give a very general rule (almost useless)

Then, specialize (how? When to stop?)

Page 29: Web-based Information Architectures

Methods for Supervised DMClassifiers• Linear Separators (regression)• Naive Bayes (NB)• Decision Trees (DTs)• k-Nearest Neighbor (kNN)• Decision rule induction• Support Vector Machines (SVMs)• Neural Networks (NNs) ...