1
Searching through the Internet
Dr. Eslam Al MaghayrehComputer Science Department
Yarmouk University
Introduction Internet has enormous quantity of information:
billions of web pages thousands of newsgroups
Two questions face any information seeker: (1) How can I find what I want? (2) How can I know that what I find is any
good?
3
4
Information Retrieval Goal = find documents relevant to an
information need from a large document set
Document collection
Info. need
Query
Answer list
IR system
Retrieval
Search Engine Consists of:
the interface you use to type in a query an index of Web sites that the query is
matched with and a software program (called a spider or
bot) that goes out on the Web and gets new sites for the index
6
7
IR problem First applications: in libraries (1950s)
ISBN: 0-201-12227-8 Author: Salton, Gerard Title: Automatic text processing: the transformation,
analysis, and retrieval of information by computer Editor: Addison-Wesley Date: 1989Content: <Text>
External attributes and internal attribute (content)
Search by external attributes = Search in DB IR: search by content
8
Possible approaches
1. String matching (linear search in documents)- Slow
2. Indexing- Fast- Flexible to further improvement
9
DocumentsQuery
Results
Indexing Indexing
Query Representation Document Representation
ComparisonFunction Index
10
Main problems in IR Query evaluation (or retrieval process)
To what extent does a document correspond to a query?
System evaluation How good is a system? Are the retrieved documents
relevant? (precision) Are all the relevant documents
retrieved? (recall)
11
Document indexing Goal = Find the important meanings and create
an internal representation Factors to consider:
Accuracy to represent meanings (semantics) Exhaustiveness (cover all the contents) Facility for computer to manipulate
What is the best representation of contents? Word: good coverage, not precise Phrase: poor coverage, more precise Concept: poor coverage, precise
Coverage(Recall)
Accuracy(Precision)Word Phrase Concept
12
Keyword selection and weighting
How to select important keywords? Simple method: using middle-frequency words Search engines usually disregard minor words
such as "the, and, to, etc."
Frequency/Informativity frequency informativity Max. Min.
1 2 3 … Rank
13
Result of indexing Each document is represented by a set of
weighted keywords (terms):D1 {(t1, w1), (t2,w2), …}
e.g. D1 {(comput, 0.2), (architect, 0.3), …}
D2 {(comput, 0.1), (network, 0.5), …}
14
Retrieval The problems underlying retrieval
Retrieval model How is a document represented with the
selected keywords? How are document and query
representations compared to calculate a score?
15
Vector space model Vector space = all the keywords
encountered<t1, t2, t3, …, tn>
DocumentD = < a1, a2, a3, …, an>
ai = weight of ti in D Query
Q = < b1, b2, b3, …, bn>
bi = weight of ti in Q R(D,Q) = Sim(D,Q)
16
Matrix representation t1 t2 t3 … tn
D1 a11 a12 a13 … a1n
D2 a21 a22 a23 … a2n
D3 a31 a32 a33 … a3n
…Dm am1 am2 am3 … amn
Q b1 b2 b3 … bn
Term vector space
Document space
17
Some formulas for Sim
Dot product
Cosine
Dice
Jaccard
i i iiiii
iii
i iii
iii
i iii
iii
ii
baba
baQDSim
ba
baQDSim
ba
baQDSim
baQDSim
) * (
) * (),(
) * (2),(
*
) * (),(
) * (),(
22
22
22
t1
t2
D
Q
18
(Classic) Presentation of results
Query evaluation result is a list of documents, sorted by their similarity to the query.
E.g.doc10.67doc20.65doc30.54…
19
IR on the Web No stable document collection
(spider, crawler) Duplication Huge number of documents Multimedia documents Multilingual problem …
Tips for smarter Internet searching Use unique, specific terms Use the minus operator (-) to narrow the search
yarmouk -university Utilize quotation marks, to view "consecutive
words of a phrase," such as "flower arrangement."
Enter a short question, such as " what time is it in amman?“, “3.55*4.5-11 =“, “who is the king of england?”, “what is the distance between the sun and earth”
20
Smarter Internet Searching inurl:test results
only test must be found in the web address (URL)
allinurl:test results Both test AND results must be found in the
web address. define:
will provide definitions of the words, gathered from various online sources.
define: search engine
21
Smarter Internet Searching Allintext
Sometimes you get pages that do not have your search term/phrase in them.
Why? Because Google also searches for pages that just link to the target page.
Use allintext to get only those pages that have your search terms in them.
22
Smarter Internet Searching Allinanchor:
Returns only pages that link to pages with your search terms, but not in the actual pages.
This is the opposite of allintext. Site:
Limit your search to a specific web site. Example:
students site:yu.edu.jo students site:yu.edu.jo filetype:pdf
23
Smarter Internet Searching Don't use common words and punctuation
Common words and punctuation marks should be used when searching for a specific phrase inside quotes
Most search engines do not distinguish between uppercase and lowercase
Maximize AutoComplete
24
Smarter Internet Searching The wildcard operator (*): Google calls it the
fill in the blank operator. For example, amusement * will return pages with amusement and any other term(s) the Google search engine deems relevant.
Using a wildcard (*) for a character does not work in Google. cat* returns the same results as cat.
25
Smarter Internet Searching Related sites:
For example, related:www.yu.edu.jo can be used to find sites similar to Yarmouk University site.
Specific file type: For example Information retrieval filetype:ppt
26
Examples
Searching for papers YU library Google scholar
Searching for instructor resources Morgan Kaufmann Pearson
27
Examples Searching for books to buy
Amazon.com Ebay.com
Searching for items to buy Electronics: bustbuy.com
Searching for hotels Expedia.com Priceline.com Booking.com
28
Examples
Regional search Google jo
Searching for images Google images
Searching for a job Jobsinacademia.net Academickeys.com
29