Upload
anna-doyle
View
217
Download
0
Tags:
Embed Size (px)
Citation preview
©2012 Paula Matuszek
CSC 9010: Text Mining Applications:
Information Retrieval
Dr. Paula Matuszek
(610) 647-9789
©2012 Paula Matuszek
Knowledge Knowledge is captured in large quantities and many
forms Much of the knowledge is in unstructured text
– books, journals, papers– letters– web pages, blogs, tweets
A very old process– accelerated greatly with the invention of the printing press– and again with the invention of computers– and again with the advent of the web
Thus the increasing importance of text mining!
©2012 Paula Matuszek
A question! People interact with all that information
because they want to KNOW something; there is a question they are trying to answer or a piece of information they want. They have an information need.
Hopefully there is some information somewhere that will satisfy that need
At its most general, information retrieval is the process of finding the information that meets that need.
©2012 Paula Matuszek
Basic Information Retrieval Simplest approach:
– Knowledge is organized into chunks (pages)– Goal is to return appropriate chunks
Not a new problem But some new solutions!
– Web Search engines– Text mining includes this process also
– still dealing with lots of unstructured text
– finding the appropriate “chunk” can be viewed as a classification problem
©2012 Paula Matuszek
Search Engines Goal of search engine is to return
appropriate chunks Steps involve include
– asking a question– finding answers– evaluating answers– presenting answers
Value of a search engine depends on how well it does on all of these.
©2012 Paula Matuszek
Asking a question Reflect some information need Query Syntax needs to allow information need to
be expressed– Keywords– Combining terms
– Simple: “required”, NOT (+ and -)
– Boolean expressions with and/or/not and nested parentheses
– Variations: strings, NEAR, capitalization.
– Simplest syntax that works– Typically more acceptable if predictable
Another set of problems when information isn’t text: graphics, music
©2012 Paula Matuszek
Finding the Information Goal is to retrieve all relevant chunks. Too time-
consuming to do in real-time, so search engines index pages.
Two basic approaches– Index and classify by hand
– Automate For BOTH approaches deciding what to index on (e.g.,
what is a keyword) is a significant issue.– stemming?– stopwords?– capitalization?
©2012 Paula Matuszek
Indexing by Hand Indexing by hand involves having a person look at
information items and assign them to categories.– Assumes taxonomy of categories exists – Each document can go into multiple categories– Creates high quality indices – Expensive to create– Supports hierarchical browsing for retrieval as well
as search Inter-rater reliability is an issue; requires training and
checking to get consistent category assignment
©2012 Paula Matuszek
Indexing By Hand For focused collections, even very large ones, feasible
– Medline– ACM papers– NY Times hand-indexes all abstracts
For the web as a whole, not yet feasible Evolving solutions
– social bookmarking: delicious, reddit, digg– Hash tags: twitter, Google+– Sometimes pre-structured. More often completely
freeform– In some domains become a folksonomy.
©2012 Paula Matuszek
Automated Indexing Automated indexing involves parsing documents to
pull out key words and creating a table which links keywords to documents– Doesn’t have any predefined categories or
keywords– Can cover a much higher proportion of the
information available– Can update more quickly– Much lower quality, therefore important to have
some kind of relevance ranking
©2012 Paula Matuszek
Or IE-Based
Good information extraction tools can be used to extract the important terms– Using gazetteers and ontologies to identify
terms– Using named entity and other rules to
assign categories I2EOnDemand is a good example
©2012 Paula Matuszek
Automating Search Always involves balancing factors:
– Recall, Precision– Which is more important varies with query and
with coverage
– Speed, storage, completeness, timeliness– Query response needs to be fast– Documents searched need to be current
– Ease of use vs power of queries– Full Boolean queries very rich, very confusing. – Simplest is “and”ing together keywords; fast,
straightforward.
©2012 Paula Matuszek
Search Engine Basics
A spider or crawler starts at a web page, identifies all links on it, and follows them to new web pages.
A parser processes each web page and extracts individual words.
An indexer creates/updates a hash table which connects words with documents
A searcher uses the hash table to retrieve documents based on words
A ranking system decides the order in which to present the documents: their relevance
©2012 Paula Matuszek
Selecting Relevant Documents
Assume:– we already have a corpus of documents defined. – goal is to return a subset of those documents.– Individual documents have been separated into
individual files Remaining components must parse, index,
find, and rank documents. Traditional approach is based on the words in
the documents (predates the web)
©2012 Paula Matuszek
Extracting Lexical Features
Process a string of characters– assemble characters into tokens (tokenizer)– choose tokens to index
In place (problem for www) Standard lexical analysis problem Lexical Analyser Generator, such as lex Tokenizers such as the NLTK and
GATE tokenizers
©2012 Paula Matuszek
Lexical Analyser
Basic idea is a finite state machine Triples of input state, transition token, output state
Must be very efficient; gets used a LOT
0
1
2
blank
A-Z
A-Z
blank, EOF
©2012 Paula Matuszek
Design Issues for Lexical Analyser
Punctuation– treat as whitespace?
– treat as characters?
– treat specially? Case
– fold? Digits
– assemble into numbers?
– treat as characters?
– treat as punctuation?
©2012 Paula Matuszek
Lexical Analyser
Output of lexical analyser is a string of tokens
Remaining operations are all on these tokens
We have already thrown away some information; makes more efficient, but limits the power of our search– can’t distinguish “VITA” from “Vita”– Can be somewhat remedied at “relevance” step
©2012 Paula Matuszek
Stemming Additional processing at the token level Turn words into a canonical form:
– “cars” into “car”– “children” into “child”– “walked” into “walk”
Decreases the total number of different tokens to be processed
Decreases the precision of a search, but increases its recall
NLTK, GATE, other stemmers
©2012 Paula Matuszek
Noise Words (Stop Words)
Function words that contribute little or nothing to meaning
Very frequent words– If a word occurs in every document, it is not
useful in choosing among documents– However, need to be careful, because this
is corpus-dependent Often implemented as a discrete list
©2012 Paula Matuszek
Example Corpora We are assuming a fixed corpus. Some
sample corpora:– Medline Abstracts– Email. Anyone’s email.– Reuters corpus– Brown corpus
Textual fields, structured attributes– Textual: free, unformatted, no meta-information– Structured: additional information beyond the
content
©2012 Paula Matuszek
Textual Fields for Medline Abstract
– Reasonably complete standard academic English
– Capturing the basic meaning of document Title
– Short, formalized – Captures most critical part of meaning– Proxy for abstract
©2012 Paula Matuszek
Structured Fields for Email
To, From, Cc, Bcc Dates Content type Status Content length Subject (partially)
©2012 Paula Matuszek
Text fields for Email Subject
– Format is structured, content is arbitrary. – Captures most critical part of content. – Proxy for content -- but may be inaccurate.
Body of email– Highly irregular, informal English. – Entire document, not summary. – Spelling and grammar irregularities. – Structure and length vary.
©2012 Paula Matuszek
Indexing We have a tokenized, stemmed
sequence of words Next step is to parse document,
extracting index terms– Assume that each token is a word and we
don’t want to recognize any more complex structures than single words.
When all documents are processed, create index
©2012 Paula Matuszek
Basic Indexing Algorithm For each document in the corpus
– get the next token– save the posting in a list
– doc ID, frequency
For each token found in the corpus– calculate #doc, total frequency– sort by frequency– This is the inverse index
©2012 Paula Matuszek
Fine Points Dynamic Corpora: requires incremental
algorithms Higher-resolution data (e.g, char position) Giving extra weight to proxy text (typically
by doubling or tripling frequency count) Document-type-specific processing
– In HTML, want to ignore tags– In email, maybe want to ignore quoted
material
©2012 Paula Matuszek
Choosing Keywords Don’t necessarily want to index on every
word– Takes more space for index– Takes more processing time– May not improve our resolving power
How do we choose keywords?– Manually– Statistically
Exhaustivity vs specificity
©2012 Paula Matuszek
Manually Choosing Keywords
Unconstrained vocabulary: allow creator of document to choose whatever he/she wants– “best” match– captures new terms easily– easiest for person choosing keywords
Constrained vocabulary: hand-crafted ontologies– can include hierarchical and other relations– more consistent– easier for searching; possible “magic bullet”
search
©2012 Paula Matuszek
Examples of Constrained Vocabularies
ACM Computing Classification System (www.acm.org/class/1998)
– H: Information Retrieval H3: Information Storage and Retrieval H3.3: Information Search and Retrieval
» Clustering» Information Filtering» Query formulation
» Relevance feedback » etc.
Medline Headings (www.nlm.nih.gov/mesh/MBrowser.html)
– L: Information Science L01: Information Science L01.700: Medical Informatics L01.700.508: Medical Informatics Applications L01.700.508.280: Information Storage and Retrieval
» MedlinePlus [L01.700.508.280.730]
©2012 Paula Matuszek
Automated Vocabulary Selection
Frequency: Zipf’s Law. – In a natural language corpus, frequency of a word is
inversely proportional to its position in a frequency table.
Within one corpus, words with middle frequencies are typically “best”– We have used this in NLTK classification, ignoring
the most frequent terms in creating the BOW. Document-oriented representation bias: lots of
keywords/document Query-Oriented representation bias: only the
“most typical” words. Assumes that we are comparing across documents.
©2012 Paula Matuszek
Choosing Keywords “Best” depends on actual use; if a word
only occurs in one document, may be very good for retrieving that document; not, however, very effective overall.
Words which have no resolving power within a corpus may be best choices across corpora
©2012 Paula Matuszek
Keyword Choice for WWW We don’t have a fixed corpus of
documents New terms appear fairly regularly, and are
likely to be common search terms Queries that people want to make are
wide-ranging and unpredictable Therefore: can’t limit keywords, except
possibly to eliminate stop words. Even stop words are language-dependent.
So determine language first.
©2012 Paula Matuszek
Comparing and Ranking Documents
Once our search engine has retrieved a set of documents, we may want to
Rank them by relevance – Which are the best fit to my query?– This involves determining what the query is
about and how well the document answers it Compare them
– Show me more like this.– This involves determining what the
document is about.
©2012 Paula Matuszek
Determining Relevance by Keyword
The typical web query consists entirely of keywords.
Retrieval can be binary: present or absent More sophisticated is to look for degree of
relatedness: how much does this document reflect what the query is about?
Simple strategies:– How many times does word occur in document?– How close to head of document?– If multiple keywords, how close together?
©2012 Paula Matuszek
Keywords for Relevance Ranking
Count: repetition is an indication of emphasis– Very fast (usually in the index)– Reasonable heuristic– Unduly influenced by document length– Can be "stuffed" by web designers
Position: Lead paragraphs summarize content– Requires more computation– Also reasonably heuristic– Less influenced by document length– Harder to "stuff"; can only have a few keywords near
beginning
©2012 Paula Matuszek
Keywords for Relevance Ranking
Proximity for multiple keywords– Requires even more computation– Obviously relevant only if have multiple keywords– Effectiveness of heuristic varies with information
need; typically either excellent or not very helpful at all
– Very hard to "stuff" All keyword methods
– Are computationally simple and adequately fast– Are effective heuristics– typically perform as well as in-depth natural
language methods for standard search
©2012 Paula Matuszek
Comparing Documents
"Find me more like this one" really means that we are using the document as a query.
This requires that we have some conception of what a document is about overall.
Depends on context of query. We need to– Characterize the entire content of this document– Discriminate between this document and others in
the corpus This is basically a document classification problem.
©2012 Paula Matuszek
Describing an Entire Document
So what is a document about? TF*IDF: can simply list keywords in
order of their TF*IDF values Document is about all of them to some
degree: it is at some point in some vector space of meaning
©2012 Paula Matuszek
Vector Space Any corpus has defined set of terms (index) These terms define a knowledge space Every document is somewhere in that
knowledge space -- it is or is not about each of those terms.
Consider each term as a vector. Then– We have an n-dimensional vector space– Where n is the number of terms (very large!)– Each document is a point in that vector space
The document position in this vector space can be treated as what the document is about.
©2012 Paula Matuszek
Similarity Between Documents
How similar are two documents?– Measures of association
– How much do the feature sets overlap?– Simple Matching coefficient: take into account
exclusions
– Cosine similarity– similarity of angle of the two document vectors– not sensitive to vector length
Same basic similarity ideas as classification and clustering
©2012 Paula Matuszek
Additional Search Engine Issues
Freshness: how often to revisit documents Eliminate duplicate documents Eliminate multiple documents from one site Provide good context Non-content based features: citation
graphs (basis of Page rank) Search Engine Optimization
©2012 Paula Matuszek
Beyond Simple Search
Information Extraction on queries to recognize some common patterns– Airline flights– tracking #w
Rich IE systems like I2E Taxonomy browsing
©2012 Paula Matuszek
Beyond Unstructured Text
“Improved” search Specific types of text Non-text search constraints Faceted Search Searching non-text information assets Personalizing search
©2012 Paula Matuszek
Improved Search Modern search engines tweak your query a
lot. Google, for instance, says it will normally– suggest spelling corrections and alternative
spellings– personalize your search by using information such
as sites you’ve visited before– include synonyms of your search terms– find results that match similar terms to those in
your query– stem your search terms
(http://support.google.com/websearch/bin/answer.py?hl=en&p=g_verb&answer=1734130 )
©2012 Paula Matuszek
Domain of Documents May be desirable to limit search to specific types
of document Google gives you, among other things
– news– books– blogs– recipes– patents– discussions
May be based on the source Or may be text mining (classification) at work :-)
©2012 Paula Matuszek
Non-Text Constraints We may know some things about documents
that are not captured in the tokens: Meta-information included in the document
– date created and modified– author, department– keywords or tags
Information that can be determined by examining the document– language– images included?– reading level
©2012 Paula Matuszek
Faceted Search Faceted Search: constrain search along several
attributes Research topic in information retrieval especially
for the last ten years– Flamenco project at Berkeley (flamenco.berkeley.edu)– CiteSeer project at Penn State (citeseer.ist.psu.edu)
Labeling with facets has same issues as hand-indexing– except when you already have the information in a
database somewhere Has become popular primarily for online
commerce.
©2012 Paula Matuszek
Faceted Search Examples Search Amazon for “rug”
– Generally applicable facets: Department, Amazon Prime Eligible, Average Customer Review, Price, Discount, Availability
– Object-specific facets: size, material, pattern, style, theme, color
Flamenco Fine Arts demo– http://orange.sims.berkeley.edu/cgi-bin/
flamenco.cgi/famuseum/Flamenco
©2012 Paula Matuszek
Searching non-Text Assets Our information chunks might not be text
– images, sounds, videos, maps Images, sounds, videos often based on
proxy information: captions, tags Information Extraction useful for maps Still an active research area, typically at
the intersection of information retrieval and machine learning. http://www.gwap.com/gwap/
©2012 Paula Matuszek
Personalized Search What searches you have done in the past tells a lot
about your information needs If a search engine has and makes use of that
information it can improve your searches– web page history– cookies– explicit sign-in
Relevance can be influenced by pages you’ve visited, click-throughs
May go beyond to use information such as your blog posts, friends links, etc.
Can be a privacy issue
©2012 Paula Matuszek
Summary
Information Retrieval is the process of finding and providing to the user chunks of information which are relevant to some information need
Where the chunks are text, free text search tools are the common approach
Text mining tools such as document classification and information extraction can improve the relevance of search results
This is not new with the web, but the web has had a massive impact on the area
It continues to evolve, rapidly