Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Search EnginesIntroduction
Search Engine Overview
User Intermediary Information
What am I looking for?- Identification of info. Need
What question do I ask?- Query formulation
What is the searcher looking for?- Discovery of user’s info. need How should the question be posed? - Query representation Where is the relevant information?
- Query-document matching
What data to collect?- Collection development What information to index?- Indexing/Representation How to represent it?- Data structure
Search Engines 2
Searchable Index(색인)
Query(질의)
Search Results
1
23
0
Search Data (0)
(1) Query Indexing(2) Document Ranking(3) Result Display
1. Document Collection- e.g., spider/crawler
2. Document Indexing- term indexing
(tokenizing, stop & stem)- term weighting
Search Engine: Data Document Collection
Select target data sources – e.g., domain, corpus, WWW
Harvest data – e.g., data entry, data import, spider/crawler
Document Indexing Select indexing sources (색인어) – e.g., metadata, keywords, content
Extract indexing terms – e.g., tokenization, stop & stem
Assign term weights – e.g., tf-idf, okapi
Search Engines 3
“The frequency of word occurrence in an article furnishes a useful measurement of word significance.”- 문헌에출현한던어들은문헌의내용분석을위해사용될수있으며, 단어의
출현빈도가이단어의주제어로서의중요성을측정하는기준이된다 .
Luhn, H.P. (1958). The automatic creation of literature abstracts. IBM Journal of Research and Development, 2, 159-165.
TokensTokens
Search Engine: Indexing Process
Search Engines 4
Documents(Text)
Tokenization
Token Selection
Token Normalization
Tokens
TokensTokensSelectTokens
TokensTokensSEQUENTIALINDEX
Term Weighting
INVERTEDINDEX
D1 D2 D3
wd1 (information) 1 1 1
wd2 (model) 0 1 1
wd3 (retrieval) 1 2 0
wd4 (seminar) 1 0 0
D1: Information retrieval seminarsD2: Retrieval Models and Information RetrievalD3: Information Model
D1 information 1, retrieval 1, seminar 1
D2 information 1, model 1, retrieval 2
D3 information 1, model 1
D1: information, retrieval, seminar(s)D2: retrieval, model(s), and, information, retrievalD3: information, model
0
0
1
2
3
4
5
1 2 34
5
Search Engine: Search
Query Indexing Tokenization Stop & Stem Term Weighting
Document Ranking Query-Document matching Document Score computation
Result Display Content - e.g., title & snippets
Layout - e.g., grouped by category
Toppings - e.g., related searches
Search Engines 5
Index Term D1 D2 D3
wd1 (information) 1 1 1
wd2 (model) 0 1 1
wd3 (retrieval) 1 2 0
wd4 (seminar) 1 0 0
Rank docID score
1 D2 3
2 D1 2
3 D3 1
Query: What is information retrieval?Q: Information 1, retrieval 1
Search Engines 6
1
2
3
4
5
6
7
8
9
10
11
12
13
14
2015
Search Engines 7
15
16
17
18
19 20
Result Categories1. Encyclopedia2. Naver Books3. Q&A DB (지식iN)4. Magazine5. Café6. Blog7. Book8. Map9. Website10. Advertisement (파워링크)
11. Image12. Webpage13. Naver News Library14. Video15. Naver AppStore16. Naver Scholar17. Naver Post18. Naver Shopping19. News20. Naver Dictionary
Proprietary (Naver-specific) content Dynamic category order Toppings
• Search by Category• Related Searches• Popular Searches (by category)
2015Query: 정보검색(Information Retrieval)
Search Engines 8
Query: 정보검색(Information Retrieval) 2020
NAVER 2020 NAVER 2015
1. Encyclopedia (지식백과) 1. Encyclopedia
2. Naver Dictionary (어학사전) 20. Naver Dictionary
3. Website (웹사이트) 9. Website
4. Advertisement (파워링크) 10. Advertisement
5. Naver Post (포스트) 17. Naver Post
6. Blog (블로그) 6. Blog
7. Video 14. Video8. Online Open Courses
(온라인공개강좌)
9. Q&A DB (지식iN) 3. Q&A DB
10. Café (카페) 5. Café
10. Naver AppStore (앱정보) 15. Naver AppStore
10. Naver Books (Naver 책)2. Naver Books7. Book (본문검색)
10. Image 11. Image
4. Magazine
8. Map
12. Webpage
13. Naver News Library
16. Naver Scholar
17. Naver Shopping
19. News
Search Engines 9
2020Query: 검색엔진(Search Engine)
NAVER 2020 (검색엔진) NAVER 2020 (정보검색)
1. Advertisement (파워링크) 1. Encyclopedia (지식백과)
2. Encyclopedia (지식백과) 2. Naver Dictionary (어학사전)
3. Naver Dictionary (어학사전) 3. Website (웹사이트)
4. Website (웹사이트) 4. Advertisement (파워링크)
5. Naver Post (포스트) 5. Naver Post (포스트)
6. Advertisement (비즈사이트) 6. Blog (블로그)
7. Blog (블로그) 7. Video
7. Video 8. Online Open Courses(온라인공개강좌)
9. Q&A DB (지식iN) 9. Q&A DB (지식iN)
10. Café (카페) 10. Café (카페)
10. Naver AppStore (앱정보) 10. Naver AppStore (앱정보)
10. Naver Books (Naver 책) 10. Naver Books (Naver 책)
10. Image 10. Image
Search Engines 10
1
2
Result Categories1. Webpage2. Advertisement
Webpage-centric content Dynamic category order Toppings
• Search by Category• Related Searches
2015Query: Information Retrieval
Search Engines 11
Query: Information Retrieval 2020
Google 2020 NAVER 2020
1. Wikipedia 1. Encyclopedia (지식백과)
2. Knowledge Panel 2. Naver Dictionary (어학사전)
3. Answer Box 3. Website (웹사이트)
4. Webpage 4. Advertisement (파워링크)
5. Naver Post (포스트)
6. Blog (블로그)
7. Video8. Online Open Courses
(온라인공개강좌)
9. Q&A DB (지식iN)
10. Café
10. AppStore, Books, Image
5. Related Searches Related Searches (연관검색어)
Top Categories Top Categories (subset)
Image Naver Dictionary (어학사전)
Video Image
News Blog
Books News
Maps Books
Shopping Encyclopedia (지식백과)
Finance Website
Search Engines 12
Query: Search Engines 2020
Google 2020 (Search Engines)
Google 2020 (Information Retrieval)
1. Wikipedia 1. Wikipedia
2. Knowledge Panel 2. Knowledge Panel
3. Answer Box 3. Answer Box
4. Disambiguation Box 4. Webpage
5. Webpage
6. Top Stories (News)
7. Webpage
8. Related Searches 5. Related Searches
Google SERP Features by Overthink Group
Knowledge Graphs Knowledge Panel Answer Box Disambiguation Box Carousels Google Posts
Search Engine vs. Database vs. Directories
Search Engines 13
Search Engine Database Directories
Corpus Type General Specific General/Specific
Data Collection Automatic - crawler/spider
Manual - data entry/import
Manual- classification
Data Quality Not controlled Controlled Controlled
Data Organization None(bag-of-words)
Structured - Relational
Structured - Hierarchical
Query Input Text box Field-specific - Boolean
Text box
Search Result Ranked- documents
Not ranked- records
Ranked- categories
Search Index Document text Database Tables Category Tree
e.g. Google Library Search curlie.org