26
WIRED Week 4 WIRED Week 4 Syllabus Review Readings Overview -Web IR Chapter -Brin & Page - Google -Kobayashi & Takeda – Overview Search Engine Optimization Assignment Overview & Scheduling Projects and/or Papers Discussion -Idea Pitch -Group Formation

WIRED Week 4 Syllabus Review Readings Overview - Web IR Chapter - Brin & Page - Google - Kobayashi & Takeda – Overview Search Engine Optimization Assignment

Embed Size (px)

Citation preview

Page 1: WIRED Week 4 Syllabus Review Readings Overview - Web IR Chapter - Brin & Page - Google - Kobayashi & Takeda – Overview Search Engine Optimization Assignment

WIRED Week 4WIRED Week 4

•Syllabus Review•Readings Overview-Web IR Chapter-Brin & Page - Google-Kobayashi & Takeda – Overview

•Search Engine Optimization•Assignment Overview & Scheduling•Projects and/or Papers Discussion-Idea Pitch-Group Formation

Page 2: WIRED Week 4 Syllabus Review Readings Overview - Web IR Chapter - Brin & Page - Google - Kobayashi & Takeda – Overview Search Engine Optimization Assignment

Web makes IR an everyday activityWeb makes IR an everyday activity

• Search Engines• Search Interfaces• The openness of the Web changes everything- Access- Technological progress- Expectation- Credibility- Networks and Networking

Page 3: WIRED Week 4 Syllabus Review Readings Overview - Web IR Chapter - Brin & Page - Google - Kobayashi & Takeda – Overview Search Engine Optimization Assignment

How Much Information Out there?How Much Information Out there?

• UC Berkeley project• Center for the Digital Future• Pew Internet & American Life Project• What kinds of information is it?• What formats?

- Information = Web pages?- Now- Future

• Who creates it?• Why do they publish it?

• Content and Context

Page 4: WIRED Week 4 Syllabus Review Readings Overview - Web IR Chapter - Brin & Page - Google - Kobayashi & Takeda – Overview Search Engine Optimization Assignment

Investigation of Web DocumentsInvestigation of Web Documents

• Are Web documents different?- Structure - HTML & other markup

• Common tags- Content - “information” & commerce

• Readability• Usability

- Context - “sociological insights”, & spam• Links

- Interest - topics, titles, keywords, file types- Interface - browsers (& crawlers)

• Older study, what’s new?- More multimedia- XHTML & XML- AJAX, REST, SOAP, Web 2.0?

Page 5: WIRED Week 4 Syllabus Review Readings Overview - Web IR Chapter - Brin & Page - Google - Kobayashi & Takeda – Overview Search Engine Optimization Assignment

Statistical Profiles of Highly-rated Web sitesStatistical Profiles of Highly-rated Web sites

• A Quality Checker- Good design makes better Web pages• Look at popular pages & see what makes them

popular• We know good pages when we see (use) them

- Different types of Web page sturctures

• Elements- Text, links & graphics (& their formatting)- Accessibility, Size, errors, nav links (scent)- Architecture of site

• What makes these pages good for searching?

Page 6: WIRED Week 4 Syllabus Review Readings Overview - Web IR Chapter - Brin & Page - Google - Kobayashi & Takeda – Overview Search Engine Optimization Assignment

Content - Organizing & AccessingContent - Organizing & Accessing

• Distributed Data(base)• Dynamic Data- Mobile- Ephemeral

• Huge Volume• Unstructured and Redundant• Quality• Heterogeneous- Languages- Code pages

Page 7: WIRED Week 4 Syllabus Review Readings Overview - Web IR Chapter - Brin & Page - Google - Kobayashi & Takeda – Overview Search Engine Optimization Assignment

Measuring the WebMeasuring the Web

• How would you measure?- Size (crawling)- Surveys- Hits & Metering- Bandwidth use

• What do numbers mean?- Number of Hosts?- Number of Sites?- Number of Pages?

• Accurate +/- a lot

Page 8: WIRED Week 4 Syllabus Review Readings Overview - Web IR Chapter - Brin & Page - Google - Kobayashi & Takeda – Overview Search Engine Optimization Assignment

The Web is a Bowtie?The Web is a Bowtie?• Structure

- pass from any node of IN through SCC to any node of OUT

- hanging off IN & OUT are TENDRILS containing nodes that are reachable from portions of IN, or that can reach portions of OUT , without passage through SCC

- a TENDRIL hanging off from IN to be hooked into a TENDRIL leading into OUT , forming a TUBE - a passage from a portion of IN to a portion of OUT without touching SCC .

• Broder, et. al 2000

Page 9: WIRED Week 4 Syllabus Review Readings Overview - Web IR Chapter - Brin & Page - Google - Kobayashi & Takeda – Overview Search Engine Optimization Assignment

Web Search EnginesWeb Search Engines

• Independent of IR model• Distributed index and servers- Crawler- Query server- Indexer

• Crawlers and Spiders- Centralized control, Coordinated, Refresh, Filtering- Not the main problem

• Queries- Interface, processing, results

• Indexing- Data normalization, load balancing, data sharing

Page 10: WIRED Week 4 Syllabus Review Readings Overview - Web IR Chapter - Brin & Page - Google - Kobayashi & Takeda – Overview Search Engine Optimization Assignment

HarvestingHarvesting

• Not just Web data- Caching, Duplication, Normalization

• Armies of crawlers• Filtering collected data• Gatherers- Collects and extracts on various schedules- Works with several brokers

• Brokers- Indexes and interfaces to queries- Works with other Brokers and Gatherers

• Topical Agents?

Page 11: WIRED Week 4 Syllabus Review Readings Overview - Web IR Chapter - Brin & Page - Google - Kobayashi & Takeda – Overview Search Engine Optimization Assignment

Web Crawling IssuesWeb Crawling Issues• Follow chains of URLs to gather more URLs• Extract index (content) from each page• Lather-Rinse-Repeat• Update crawler to-do list• Associate frequency of crawls• Breadth or Depth first?• Endless looping• Duplicate pages/sites• Changed page (or not really?)• Dynamically generated pages• Intranet pages• Markup language getting in the way• NOROBOTS

• What should a crawler get?

Page 12: WIRED Week 4 Syllabus Review Readings Overview - Web IR Chapter - Brin & Page - Google - Kobayashi & Takeda – Overview Search Engine Optimization Assignment

Indexing the WebIndexing the Web

• Inverted File Index- Sorted words with pointers to location(s) & page(s)- Pointers are the focus (inversion)

• What about pages and sites?- Massive redundancy on well-organized sites• Navigation• Topics• Content

• “State of the art indexing techniques” = 30% of text (not page) size. p 383

• How can you tune an index for massively changing documents?

Page 13: WIRED Week 4 Syllabus Review Readings Overview - Web IR Chapter - Brin & Page - Google - Kobayashi & Takeda – Overview Search Engine Optimization Assignment

RankingRanking

• Boolean and Vector models mostly used- Why?- Works from the index, not the text

• Which ranking methods are best?- Datasets- Syntaxes- Users & Testing

Page 14: WIRED Week 4 Syllabus Review Readings Overview - Web IR Chapter - Brin & Page - Google - Kobayashi & Takeda – Overview Search Engine Optimization Assignment

Ranking MethodsRanking Methods

• TF-IDF- Simple, smaller data sets

• Boolean Spread- Degrees of match- Within a document- Set of documents- Links between documents (meta docs?)

• Vector Spread- Standard cosine between query and index (to

document)- Links with answer or pointing to answer

• Most Cited

Page 15: WIRED Week 4 Syllabus Review Readings Overview - Web IR Chapter - Brin & Page - Google - Kobayashi & Takeda – Overview Search Engine Optimization Assignment

Is Web ranking different?Is Web ranking different?

• Links are the difference that makes the difference- Internal links on a page- Internal links on a site- Relationships between sites- Link freshness

• Kleinberg’s HITS method (1998)- Hypertext Induced Topic Search- Number of pages that point to (processed) query- Authorities (relevant content by links)- Hubs (links to varied authorities)

Page 16: WIRED Week 4 Syllabus Review Readings Overview - Web IR Chapter - Brin & Page - Google - Kobayashi & Takeda – Overview Search Engine Optimization Assignment

Problems with Hubs & AuthoritiesProblems with Hubs & Authorities

• Is more links always better?• What about pages without many outgoing

links?• How do you count multiple links from within

one page to another?• Do automatically generated sites/pages have

an advantage?- CMS systems may have linking “fingerprints”- Metadata

• How varied are the link weights?- Simple counts- Modified by other IR measures

Page 17: WIRED Week 4 Syllabus Review Readings Overview - Web IR Chapter - Brin & Page - Google - Kobayashi & Takeda – Overview Search Engine Optimization Assignment

Anatomy of a LS Web Search EngineAnatomy of a LS Web Search Engine

• Initial Google Design• PageRank - PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))- “A model of user behavior”• probability of a random surfer visiting a page is

its PageRank + • a damping factor (boredom)

- Pages point to a page- Highly ranked pages point to a page- Anchor text is mined (the label for the link)- Proximity included

Page 18: WIRED Week 4 Syllabus Review Readings Overview - Web IR Chapter - Brin & Page - Google - Kobayashi & Takeda – Overview Search Engine Optimization Assignment

Anatomy 2Anatomy 2

• Repository of page content• Document index- Forward (sorted)- Inverted (sorter)

• Lexicon of words & pointers• Hit Lists of word occurrence(s)• Crawlers• Ranking• Feedback of selection (~)

Page 19: WIRED Week 4 Syllabus Review Readings Overview - Web IR Chapter - Brin & Page - Google - Kobayashi & Takeda – Overview Search Engine Optimization Assignment

Popularity?Popularity?

• Do you always want the most popular information source?- Talk Radio- New York Times Bestseller List- “Lincoln’s Doctors Dog”- “The C.S.I. Diet and Cookbook”

• Trend or Fad?• Blogs, Editorials and Propaganda vs.

“Facts”?• Result Diversity• Death of the Mid-List

Page 20: WIRED Week 4 Syllabus Review Readings Overview - Web IR Chapter - Brin & Page - Google - Kobayashi & Takeda – Overview Search Engine Optimization Assignment

Next Generation Web SearchNext Generation Web Search

• Search works well now (80%), but what’s next?• We need to be user-focused, not data-focused• How do we match search to the task?

- Is it all about speed?- How could metadata support search tasks?

• Best search is browsing?- Faceted Search?- Suggesting = browsing for interfaces

• Cooking• Related results

• Specialized interfaces• Natural language queries (quesiton answering)• “Real world” metadata

• Context, personalization, query specifics

Page 21: WIRED Week 4 Syllabus Review Readings Overview - Web IR Chapter - Brin & Page - Google - Kobayashi & Takeda – Overview Search Engine Optimization Assignment

Metasearch IssuesMetasearch Issues

• One place for everything?• First or Last place to look?• Better or different interface?• Combined, sorted results would be best- How to sort?- Sorting for different types of queries

• Syntax Errors• State Information (monitoring)• Copyright issues (robots)• User, content and interface

mismatches/challenges

Page 22: WIRED Week 4 Syllabus Review Readings Overview - Web IR Chapter - Brin & Page - Google - Kobayashi & Takeda – Overview Search Engine Optimization Assignment

Web Searching MetaphorsWeb Searching Metaphors

• How do people visualize the Web?• Is Browsing better?• Do we need new metaphors for using the

Web?- Searching- Browsing- What else?

Page 23: WIRED Week 4 Syllabus Review Readings Overview - Web IR Chapter - Brin & Page - Google - Kobayashi & Takeda – Overview Search Engine Optimization Assignment

AssignmentsAssignments

• Read weekly Primary Readings & Participate in class discussions 10%- 1 page summaries

• Re-design Search Results interface 10%• Web (log) analytics 20%• Future of Search (“Google 2010”) (5 page

paper) 10%• Web Information Retrieval System Evaluation

& Presentation 20%• Main Project or Paper 30%

Page 24: WIRED Week 4 Syllabus Review Readings Overview - Web IR Chapter - Brin & Page - Google - Kobayashi & Takeda – Overview Search Engine Optimization Assignment

Re-design Search Results interfaceRe-design Search Results interface

• Choose a search engine (not Google) and re-design the query AND result page interfaces- Snap, Live, Ask, Technorati, Clusty, & many others…

• Discuss what search features are and their interfaces- Highlight the good & the bad (or hard to understand or use)- Use your own perspective as a novice user or habitual user

of the search engine

• Sketch, Photoshop &/or re-build the HTML pages to show your improved interface designs- Explain why you made the interface (& feature) changes- Illustrate how people would use the new interface

• Compare to other search engines or search tools & interfaces to give context to your re-design

Page 25: WIRED Week 4 Syllabus Review Readings Overview - Web IR Chapter - Brin & Page - Google - Kobayashi & Takeda – Overview Search Engine Optimization Assignment

System Evaluation & PresentationSystem Evaluation & Presentation

- 5 page written evaluation of a Web IR System- technology overview (how it works)- a brief history of the development of this type of

system (why it works better)- intended uses for the system (who, when, why)- (your) examples or case studies of the system in

use & its overall effectiveness

Page 26: WIRED Week 4 Syllabus Review Readings Overview - Web IR Chapter - Brin & Page - Google - Kobayashi & Takeda – Overview Search Engine Optimization Assignment

Future of Search paperFuture of Search paper

• How can (Web) IR be better?- Better IR models- Better User Interfaces

• More to find vs. easier to find

• Scriptable applications• New interfaces for applications• New datasets for applications