Tdm information retrieval
Preview:
Citation preview
- 1. Text Data Mining PART I - IR
- 2. Text Mining Applications Information Retrieval Query-based
search of large text archives, e.g., the Web Text Classification
Automated assignment of topics to Web pages, e.g., Yahoo, Google
Automated classification of email into spam and non- spam Text
Clustering Automated organization of search results in real-time
into categories Discovery clusters and trends in technical
literature (e.g. CiteSeer) Information Extraction Extracting
standard fields from free-text Extracting names and places from
reports, newspapers
- 3. Information Retrieval - Definition Information retrieval
(IR) is finding material (usually documents) of an unstructured
nature (usually text) that satisfies an information need from
within large collections (usually stored on computers). Information
Retrieval Deals with the representation, storage, organization of,
and access to information items Modern Information Retrieval
General Objective: Minimize the overhead of a user locating needed
information
- 4. Information Retrieval Is Not Database Information Retrieval
Process stored documents Search documents relevant to user queries
No standard of how queries should be Query results are permissive
to errors or inaccurate items Database Normally no processing of
data Search records matching queries Standard: SQL language Query
results should have 100% accuracy. Zero tolerant to errors
- 5. Information Retrieval Is Not Data Mining Information
Retrieval User target: Existing relevant data entries Data Mining
User target: Knowledge (rules, etc.) implied by data (not the
individual data entries themselves) Many techniques and models are
shared and related E.g. classification of documents
- 6. Is Information Retrieval a Form of Text Mining? What is the
principal computer specialty for processing documents and text??
Information Retrieval (IR) The task of IR is to retrieve relevant
documents in response to a query. The fundamental technique of IR
is measuring similarity A query is examined and transformed into a
vector of values to be compared with stored documents
- 7. Is Information Retrieval a Form of Text Mining? In the
predication problem similar documents are retrieved, then measure
their properties, i.e. count the # of class labels to see which
label should be assigned to a new document The objectives of the
prediction can be posed in the form of an IR model where documents
are retrieved that are relevant to a query, the query will be a new
document
- 8. Specify Query Search Document Collection Return Subset of
Relevant Documents Key Steps in Information Retrieval Examine
Document Collection Learn Classification Criteria Apply Criteria to
New Documents Key Steps in Predictive Text Mining
- 9. Specify Query Vector Match Document Collection Get Subset of
Relevant Documents Examine Document Properties Predicting from
Retrieved Documents Key steps in IR simple criteria such as
documents labels
- 10. Information Retrieval (IR) Conceptually, IR is the study of
finding needed information. I.e., IR helps users find information
that matches their information needs. Expressed as queries
Historically, IR is about document retrieval, emphasizing document
as the basic unit. Finding documents relevant to user queries
Technically, IR studies the acquisition, organization, storage,
retrieval, and distribution of information.
- 11. Information Retrieval Cycle Source Selection Search Query
Selection Results Examination Documents Delivery Information Query
Formulation Resource source reselection System discovery Vocabulary
discovery Concept discovery Document discovery
- 12. Abstract IR Architecture DocumentsQuery Hits Representation
Function Representation Function Query Representation Document
Representation Comparison Function Index offlineonline
- 13. IR Architecture
- 14. IR Queries Keyword queries Boolean queries (using AND, OR,
NOT) Phrase queries Proximity queries Full document queries Natural
language questions
- 15. Information retrieval models An IR model governs how a
document and a query are represented and how the relevance of a
document to a user query is defined. Main models: Boolean model
Vector space model Statistical language model etc
- 16. Elements in Information Retrieval Processing of documents
Acceptance and processing of queries from users Modelling,
searching and ranking of documents Presenting the search
result
- 17. Process of Retrieving Information
- 18. Document Processing Removing stopwords (appear frequently
but no much meaning, e.g. the, of) Stemming: recognize different
words with the same grammar root Noun groups: common combination of
words Indexing: for fast locating documents
- 19. Processing Queries Define a language for queries Syntax,
operators, etc. Modify the queries for better search Ignore
meaningless parts: punctuations, conjunctives, etc. Append synonyms
e.g. e-business e-commerce Emerging technology Natural language
queries
- 20. Modelling/Ranking of Documents Model the relevance
(usefulness) of documents against the user query Q The model
represents a function Rel(Q,D) D is a document, Q is a user query
Rel(Q,D) is the relevance of document D to query Q There are many
models available Algebraic models Probabilistic models
Set-theoretic models
- 21. Basic Vector Space Model Define a set of words and phases
as terms Text is represented by a vector of terms User query is
converted to a vector, too Measure the vector distance between a
document vector and the query vector business computer PowerPoint
presentation user web Term Set We are doing an e-business
presentation in PowerPoint. Document (1,0,1,1,0,0) computer
presentation Query (0,1,0,1,0,0) 222222 )00()00()11()01()10()01(
Distance
- 22. Probabilistic Models Overview Probabilistic Models Ranking:
the probability that a document is relevant to a query Often
denoted as Pr(R|D,Q) In actual measure, log-odds transformation is
used: Probability values are estimated in applications ),|Pr(
),|Pr( log QDR QDR
- 23. Information Retrieval Given A source of textual documents A
well defined limited query (text based) Find Sentences with
relevant information Extract the relevant information and ignore
non- relevant information (important!) Link related information and
output in a predetermined format Example: news stories, e-mails,
web pages, photograph, music, statistical data, biomedical data,
etc. Information items can be in the form of text, image, video,
audio, numbers, etc.
- 24. Information Retrieval 2 basic information retrieval (IR)
process: Browsing or navigation system User skims document
collection by jumping from one document to the other via hypertext
or hypermedia links until relevant document found Classical IR
system: question answering system Query: question in natural
language Answer: directly extracted from text of document
collection Text Based Information Retrieval: Information item
(document) Text format (written/spoken) or has textual description
Information need (query)
- 25. Classical IR System Process
- 26. General concepts in IR Representation language Typically a
vector of d attribute values, e.g., set of color, intensity,
texture, features characterizing images word counts for text
documents Data set D of N objects Typically represented as an N x d
matrix Query Q User poses a query to search D Query is typically
expressed in the same representation language as the data, e.g.,
each text document is a set of words that occur in the
document
- 27. Query by Content Traditional DB query: exact matches E.g.
query Q = [level = MANAGER] AND [age < 30] or, Boolean match on
text Query = Irvine AND fun: return all docs with Irvine and fun
Not useful when there are many matches E.g., data mining in Google
returns 60 million documents Query-by-content query: more general /
less precise E.g. what record is most similar to a query Q? For
text data, often called information retrieval (IR) Can also be used
for images, sequences, video, etc Q can itself be an object (e.g.,
a document) or a shorter version (e.g., 1 word) Goal Match query Q
to the N objects in the database Return a ranked list (typically)
of the most similar/relevant objects in the data set D given Q
- 28. Issues in Query by Content What representation language to
use How to measure similarity between Q and each object in D How to
compute the results in real-time (for interactive querying) How to
rank the results for the user Allowing user feedback (query
modification) How to evaluate and compare different IR
algorithms/systems
- 29. The Standard Approach Fixed-length (d dimensional) vector
representation For query (1-by-d Q) and database (n-by-d X) objects
Use domain-specific higher-level features (vs raw) Image bag of
features: color (e.g. RGB), texture (e.g. Gabor, Fourier coeffs),
Text bag of words: freq count for each word in each document, Also
known as the vector-space model Compute distances between
vectorized representation Use k-NN to find k vectors in X closest
to Q
- 30. Text Retrieval Document: book, paper, WWW page, ... Term:
word, word-pair, phrase, (often: 50,000+) query Q = set of terms,
e.g., data + mining NLP (natural language processing) too hard, so
Want (vector) representation for text which Retains maximum useful
semantics Supports efficient distance computes between docs and Q
Term weights Boolean (e.g. term in document or not); bag of words
Real-valued (e.g. freq term in doc; relative to all docs) ...
- 31. Practical Issues Tokenization Convert document to word
counts word token = any nonempty sequence of characters for HTML
(etc) need to remove formatting Canonical forms, Stopwords,
Stemming Remove capitalization Stopwords Remove very frequent words
(a, the, and) can use standard list Can also remove very rare words
Stemming (next slide) Data representation E.g., 3 column: Inverted
index (faster) List of sorted pairs: useful for finding docs
containing certain terms Equivalent to a sparse representation of
term x doc matrix
- 32. Intelligent Information Retrieval Meaning of words Synonyms
buy / purchase Ambiguity bat (baseball vs. mammal) Order of words
in the query Hot dog stand in the amusement park Hot amusement
stand in the dog park
- 33. Key Word Search The technical goal for prediction is to
classify new, unseen documents The Prediction and IR are unified by
the computation of similarity of documents IR based on traditional
keyword search through a search engine So we should recognize that
using a search engine is a special instance of prediction
concept
- 34. Key Word Search We enter a key words to a search engine and
expect relevant documents to be returned These key words are words
in a dictionary created from the document collection and can be
viewed as a small document So, we want to measuring how similar the
new document (query) is to the documents in the collection
- 35. Key Word Search So, the notion of similarity is reduced to
finding documents with the same keywords as posed to the search
engine But, the objective of the search engine is to rank the
documents, not to assign a label So we need additional techniques
to break the expected ties (all retrieved documents match the
search criteria)
- 36. Key Word Search In full text retrieval, all the words in
each document are considered to be keywords. We use the word term
to refer to the words in a document Information-retrieval systems
typically allow query expressions formed using keywords and the
logical connectives and, or, and not Ands are implicit, even if not
explicitly specified Ranking of documents on the basis of estimated
relevance to a query is critical Relevance ranking is based on
factors such as o Term frequency Frequency of occurrence of query
keyword in document o Inverse document frequency How many documents
the query keyword occurs in Fewer give more importance to keyword o
Hyperlinks to documents
- 37. Relevance Ranking Using Terms TF-IDF (Term
frequency/Inverse Document frequency) ranking: Let n(d) = number of
terms in the document d n(d, t) = number of occurrences of term t
in the document d. Relevance of a document d to a term t The log
factor is to avoid excessive weight to frequent terms Relevance of
document to query Q n(d) n(d, t) 1 +TF (d, t) = log r (d, Q) = TF
(d, t) n(t)tQ
- 38. Relevance Ranking Using Terms Most systems add to the above
model Words that occur in title, author list, section headings,
etc. are given greater importance Words whose first occurrence is
late in the document are given lower importance Very common words
such as a, an, the, it etc are eliminated Called stop words
Proximity: if keywords in query occur close together in the
document, the document has higher importance than if they occur far
apart Documents are returned in decreasing order of relevance score
Usually only top few documents are returned, not all
- 39. Similarity Based Retrieval Similarity based retrieval -
retrieve documents similar to a given document Similarity may be
defined on the basis of common words E.g. find k terms in A with
highest TF (d, t ) / n (t ) and use these terms to find relevance
of other documents. Relevance feedback: Similarity can be used to
refine answer set to keyword query User selects a few relevant
documents from those retrieved by keyword query, and system finds
other documents similar to these Vector space model: define an
n-dimensional space, where n is the number of words in the document
set. Vector for document d goes from origin to a point whose i th
coordinate is TF (d,t ) / n (t ) The cosine of the angle between
the vectors of two
- 40. Relevance Using Hyperlinks Number of documents relevant to
a query can be enormous if only term frequencies are taken into
account Using term frequencies makes spamming easy E.g. a travel
agency can add many occurrences of the words travel to its page to
make its rank very high Most of the time people are looking for
pages from popular sites Idea: use popularity of Web site (e.g. how
many people visit it) to rank site pages that match given
keywords
- 41. Relevance Using Hyperlinks Solution: use number of
hyperlinks to a site as a measure of the popularity or prestige of
the site Count only one hyperlink from each site Popularity measure
is for site, not for individual page But, most hyperlinks are to
root of site Also, concept of site difficult to define since a URL
prefix like cs.yale.edu contains many unrelated pages of varying
popularity Refinements When computing prestige based on links to a
site, give more weight to links from sites that themselves have
higher prestige Definition is circular Set up and solve system of
simultaneous linear equations
- 42. Relevance Using Hyperlinks Connections to social networking
theories that ranked prestige of people E.g. the president of the
U.S.A has a high prestige since many people know him Someone known
by multiple prestigious people has high prestige Hub and authority
based ranking A hub is a page that stores links to many pages (on a
topic) An authority is a page that contains actual information on a
topic Each page gets a hub prestige based on prestige of
authorities that it points to Each page gets an authority prestige
based on prestige of hubs that point to it Again, prestige
definitions are cyclic, and can be got by
- 43. Nearest-Neighbor Methods A method that compares vectors and
measures similarity In Prediction: the NNMs will collect the K most
similar documents and then look at their labels In IR: the NNMs
will determine whether a satisfactory response to the search query
has been found
- 44. Measuring Similarity These measures used to examine how
documents are similar and the output is a numerical measure of
similarity Three increasingly complex measures: Shared Word Count
Word Count and Bonus Cosine Similarity
- 45. Shared Word Count Counts the shared words between documents
The words: In IR we have a global dictionary where all potential
words will be included, with the exception of stopwords. In
Prediction its better to preselect the dictionary relative to the
label
- 46. Computing similarity by Shared words Look at all words in
the new document For each document in the collection count how many
of these words appear No weighting are used, just a simple count
The dictionary has true key words (weakly words removed) The
results of this measure are clearly intuitive No one will question
why a document was retrieved
- 47. Computing similarity by Shared words Each document
represented as a vector of key words (zeros and ones) The
similarity of 2 documents is the product of the 2 vectors If 2
documents have the same key word then this word is counted (1*1)
The performance of this measure depends mainly on the dictionary
used
- 48. Computing similarity by Shared words Shared words is an
exact search Either retrieving or not retrieving a document. No
weighting can be done on terms In query, A and B, you cant specify
A is more important than B Every retrieved document are treated
equally
- 49. Word Count and Bonus TF term frequency Number of times a
term occurs in a document DF Document frequency Number of documents
that contain the term. IDF inversed document frequency =log (N/df)
N: the total number of documents Vector is a numerical
representation for a point in a multi- dimensional space. (x1, x2,
xn) Dimensions of the space need to be defined A measure of the
space needs to be defined.
- 50. Word Count and Bonus Each indexing term is a dimension Each
document is a vector Di = (ti1, ti2, ti3, ti4, ... tik) Document
similarity is defined as 0 1 1 , 1 jdfjw jwiD K j If word (j)
occurs in both documents otherwise K = number of words
- 51. Word Count and Bonus The bonus 1/df(j) is a variant of idf.
Thus, if the word occurs in many documents, the bonus is small.
This measure better than the Shared Word count, because its
discriminate among the weak and strong predictive words.
- 52. Word Count and Bonus 2.83 1.33 0 1.33 1.5 1.33 2.67 Measure
Similarity With Bonus 10101 11000 00010 10001 00100 01010 11001
Similarity Scores 1101 Labeled Spreadsheet Vector New Document
Computing Similarity Scores with Bonus A document Space is defined
by five terms: hardware, software, user, information, index. The
query is hardware, user, information.
- 53. Cosine Similarity The Vector Space A document is
represented as a vector: (W1, W2, , Wn) Binary: Wi= 1 if the
corresponding term is in the document Wi= 0 if the term is not in
the document TF: (Term Frequency) Wi= tfi where tfi is the number
of times the term occurred in the document TF*IDF: (Inverse
Document Frequency) Wi =tfi*idfi=tfi*(1+log(N/dfi)) where dfi is
the number of documents contains the term i, and N the total number
of documents in the collection.
- 54. Cosine Similarity The Vector Space vec(D) = (w1, w2, ...,
wt) Sim(d1,d2) = cos() = [vec(d1) vec(d2)] / |d1| * |d2| = [ wd1(j)
* wd2(j)] / |d1| * |d2| W(j) > 0 whenever j di So, 0