30
1 Searching through the Internet Dr. Eslam Al Maghayreh Computer Science Department Yarmouk University

1 Searching through the Internet Dr. Eslam Al Maghayreh Computer Science Department Yarmouk University

Embed Size (px)

Citation preview

1

Searching through the Internet

Dr. Eslam Al MaghayrehComputer Science Department

Yarmouk University

2

Outline

Introduction Information Retrieval Indexing Smarter Internet Searching Examples

Introduction Internet has enormous quantity of information:

billions of web pages thousands of newsgroups

Two questions face any information seeker: (1) How can I find what I want? (2) How can I know that what I find is any

good?

3

4

Information Retrieval Goal = find documents relevant to an

information need from a large document set

Document collection

Info. need

Query

Answer list

IR system

Retrieval

5

Example

Google

Web

Search Engine Consists of:

the interface you use to type in a query an index of Web sites that the query is

matched with and a software program (called a spider or

bot) that goes out on the Web and gets new sites for the index

6

7

IR problem First applications: in libraries (1950s)

ISBN: 0-201-12227-8 Author: Salton, Gerard Title: Automatic text processing: the transformation,

analysis, and retrieval of information by computer Editor: Addison-Wesley Date: 1989Content: <Text>

External attributes and internal attribute (content)

Search by external attributes = Search in DB IR: search by content

8

Possible approaches

1. String matching (linear search in documents)- Slow

2. Indexing- Fast- Flexible to further improvement

9

DocumentsQuery

Results

Indexing Indexing

Query Representation Document Representation

ComparisonFunction Index

10

Main problems in IR Query evaluation (or retrieval process)

To what extent does a document correspond to a query?

System evaluation How good is a system? Are the retrieved documents

relevant? (precision) Are all the relevant documents

retrieved? (recall)

11

Document indexing Goal = Find the important meanings and create

an internal representation Factors to consider:

Accuracy to represent meanings (semantics) Exhaustiveness (cover all the contents) Facility for computer to manipulate

What is the best representation of contents? Word: good coverage, not precise Phrase: poor coverage, more precise Concept: poor coverage, precise

Coverage(Recall)

Accuracy(Precision)Word Phrase Concept

12

Keyword selection and weighting

How to select important keywords? Simple method: using middle-frequency words Search engines usually disregard minor words

such as "the, and, to, etc."

 

Frequency/Informativity frequency informativity Max. Min.

1 2 3 … Rank

13

Result of indexing Each document is represented by a set of

weighted keywords (terms):D1 {(t1, w1), (t2,w2), …}

e.g. D1 {(comput, 0.2), (architect, 0.3), …}

D2 {(comput, 0.1), (network, 0.5), …}

14

Retrieval The problems underlying retrieval

Retrieval model How is a document represented with the

selected keywords? How are document and query

representations compared to calculate a score?

15

Vector space model Vector space = all the keywords

encountered<t1, t2, t3, …, tn>

DocumentD = < a1, a2, a3, …, an>

ai = weight of ti in D Query

Q = < b1, b2, b3, …, bn>

bi = weight of ti in Q R(D,Q) = Sim(D,Q)

16

Matrix representation t1 t2 t3 … tn

D1 a11 a12 a13 … a1n

D2 a21 a22 a23 … a2n

D3 a31 a32 a33 … a3n

…Dm am1 am2 am3 … amn

Q b1 b2 b3 … bn

Term vector space

Document space

17

Some formulas for Sim

Dot product

Cosine

Dice

Jaccard

i i iiiii

iii

i iii

iii

i iii

iii

ii

baba

baQDSim

ba

baQDSim

ba

baQDSim

baQDSim

) * (

) * (),(

) * (2),(

*

) * (),(

) * (),(

22

22

22

t1

t2

D

Q

18

(Classic) Presentation of results

Query evaluation result is a list of documents, sorted by their similarity to the query.

E.g.doc10.67doc20.65doc30.54…

19

IR on the Web No stable document collection

(spider, crawler) Duplication Huge number of documents Multimedia documents Multilingual problem …

Tips for smarter Internet searching Use unique, specific terms Use the minus operator (-) to narrow the search

yarmouk -university Utilize quotation marks, to view "consecutive

words of a phrase," such as "flower arrangement."

Enter a short question, such as " what time is it in amman?“, “3.55*4.5-11 =“, “who is the king of england?”, “what is the distance between the sun and earth”

20

Smarter Internet Searching inurl:test results

only test must be found in the web address (URL)

allinurl:test results Both test AND results must be found in the

web address. define:

will provide definitions of the words, gathered from various online sources.

define: search engine

21

Smarter Internet Searching Allintext

Sometimes you get pages that do not have your search term/phrase in them.

Why? Because Google also searches for pages that just link to the target page.

Use allintext to get only those pages that have your search terms in them.

22

Smarter Internet Searching Allinanchor:

Returns only pages that link to pages with your search terms, but not in the actual pages.

This is the opposite of allintext. Site:

Limit your search to a specific web site. Example:

students site:yu.edu.jo students site:yu.edu.jo filetype:pdf

23

Smarter Internet Searching Don't use common words and punctuation

Common words and punctuation marks should be used when searching for a specific phrase inside quotes

Most search engines do not distinguish between uppercase and lowercase

Maximize AutoComplete

24

Smarter Internet Searching The wildcard operator (*): Google calls it the

fill in the blank operator. For example, amusement * will return pages with amusement and any other term(s) the Google search engine deems relevant.

Using a wildcard (*) for a character does not work in Google. cat* returns the same results as cat.

25

Smarter Internet Searching Related sites:

For example, related:www.yu.edu.jo can be used to find sites similar to Yarmouk University site.

Specific file type: For example Information retrieval filetype:ppt

26

Examples

Searching for papers YU library Google scholar

Searching for instructor resources Morgan Kaufmann Pearson

27

Examples Searching for books to buy

Amazon.com Ebay.com

Searching for items to buy Electronics: bustbuy.com

Searching for hotels Expedia.com Priceline.com Booking.com

28

Examples

Regional search Google jo

Searching for images Google images

Searching for a job Jobsinacademia.net Academickeys.com

29

The End.

30