Download ppt - Search and the ‘Net in 2009 Trends, Challenges and Cutting-Edge Developments in Internet Search Michael Hunter Reference Librarian Hobart and William Smith

Search and the ‘Net in 2009Trends, Challenges and Cutting-Edge

Developments in Internet Search

Michael HunterReference Librarian

Hobart and William Smith Colleges

For Rochester Regional Library CouncilMember Libraries’ Staff

Sponsored by the

Rochester Regional Library Council

Supported by Regional Bibliographic Databases and Resources Sharing (RBDB) funds granted by the New York State Library 2008

For today . . .

Landscape of Search in 2009 Web 2.0 and Social Search Web 3.0 – What, Where and What it can do The Year at Google New Services Recent Developments at Established Services Possible Directions for Search in the Future Linklist

http://people.hws.edu/hunter/searchnet09.htm

How large is the Web?Internet Systems Consortium www.isc.org

Web Search in 2008Who’s crawling the Web?

Google Yahoo Live Search (MSN) Ask owns Teoma Gigablast

Size Estimates 7/9/08Google AND Yahoo!text filetypes in millions

0

1000

2000

3000

4000

5000

6000

7000

8000

html htm

Google

Yahoo!

Size Estimates 7/9/08Google AND Yahoo! text filetypes in millions

0

100

200

300

400

500

600

700

pdf doc xls ppt xlm/rss

Google

Yahoo!

User SatisfactionForeSee Results and U. Michigan 8/14/07

66

68

70

72

74

76

78

80

82

Google Yahoo! Ask

'06

"07

Web 2.0 and Social Search

Services thatLet people form online communitiesFacilitate collaborationOperate across formats and platforms

Web Search Read-only

CONSUMERS CONTENT

Full-text keyword searchMachine-based ranking

Search Engine

Web 2.0 SearchRead-Write-Create-Describe-

Organize

Consumers

Consumers

Consumers

Creators

Creators

CreatorsTaggers Taggers

Taggers

Online

Communities

Online

Com

munities

Online

Communities

Creators

Searching Web 2.0

Enhances results beyond just full-text search

Users’ interactions within “community-based” systems can be used to infer query context and intent Return all postings relevant to “steep

slopes” Return only postings by community

members sharing the same interests eg. men with heart disease interested in “steep slopes”

Web 2.0 Applications

Flickr – Share and tag photos YouTube – Share, tag, comment and

subscribe to video content Del.icio.us –Share, tag, store and

organize web content Yahoo Answers – Ask, answer, rate,

comment on questions

Social Search Engines 2004-

Incorporate human-generated metadata in retrieval and ranking of search results Tags – from creators and

readers/viewers Comments, ratings, edits, deletions Searcher-supplied interest profiles Online community-based search engines

Eurekster, Mahalo, Wikia Search, Delver, me.dium.com, Stumbleupon, Thagoo (meta)

Potential Advantages

At least one human has selected and endorsed each result

Readers generate most tags, not authors

Reduces impact of link spam in ranking User input keeps results more current

and relevant

Concerns

User-generated spam Content vandalism Unique searches benefit less

“Long tail” effect

Wikiasari: Quick rummaging search

“User ranked results” Open source SE by Jimmy Wales and

Amazon Initial results ordered with algorithms a la

Google Users reorder results, which will be used

in ranking of future similar searches Edits allowed on all search results Strength is in general search topics

Web 3.0 – The Brave New Web

Web 3.0 – The Brave New Web

Web-based applications or services that combine metadata and artificial intelligence to provide a more productive and intuitive user experience

First used by John Markoff of the New York Times in 2006

Key Components of Web 3.0Expanded from Wikipedia “Web 3.0” (Nova Spivack) http://en.wikipedia.org/wiki/Web_3.0

Semantic Web – Embedded Metadata Intelligent Applications Human Judgment and Analysis Expanded Network Computing Open Technologies – Open Source Distributed Databases

Key Components of Web 3.0

Embedded metadata (Semantic Web) Dublin Core hCard Creative Commons hCalendar RDF hReview GeoRSS hAtom

<title>Expressing Dublin Core in HTML/XHTML meta and link elements</title><link rel="schema.DC" href="http://purl.org/dc/elements/1.1/" /><meta name="DC.title" content="Expressing Qualified Dublin Core in HTML/XHTML meta and link elements" /><meta name="DC.description" content="This document is most recent version of Expressing Dublin Core in HTML/XHTML meta and link elements." /><meta name="DC.creator" content="Powell, Andy" /><meta name="DC.publisher" content="Dublin Core Metadata Initiative" />

Key Components of Web 3.0

Intelligent applications Natural Language Processing (NLP) Information extraction Machine learning Data mining

Natural Language Processing (NLP)

With NLP software unstructured text and data can be processed to reveal degrees of meaning by Extracting terms identified as significant Summarizing content Discovering relationships among terms

and groups of terms HOW???

NLP Extraction Take all articles from a group of

pharmaceutical journals published in one year (the “corpus”)

Extraction – Run a relevant controlled vocabulary (list of all known drugs) against the corpus

NLP Extraction

Drugs found, number of occurrences and location in the corpus plus a list of possible drugs not in the controlled vocabulary

86>penicillin click for locations124>tetracycline click for locations213>aspirin click for locationsAre these also drugs? XXX, XXX, XXX

NLP Summarization

Retain phrases surrounding the extracted term(s) with links to locations in the corpus (KWIC Index)

rare uses of penicillinOften penicillin is contraindicated

whenresponds well to penicillin

NLP Summarization

Tag all words in the corpus with their grammatical function and search for noun – verb – noun and other syntactic patterns

(drug A) treats (disease B)(drug C) causes (disease B)(drug D) is contraindicated in (disease

B)

NLP Term Relationship

Queries answered by tracking references across sentences

Can penicillin cause shock?

“Penicillin treatment is not without risks. In certain cases it can trigger anaphylactic shock.”

NLP can do even more …

Word disambiguationbank (river) bank (finances) bank

(verb) Retrieval of alternative word forms Retrieval of variants in capitalization and

spelling Topic detection and tracking

Following different themes in a changing RSS feed

Machine translation

Key Components of Web 3.0 Human Judgment and Analysis

Web 2.0 - Selection, tagging, rating, comment

Expanded Network Computing Distributed Computing Cloud Computing Grid Computing Interoperability

Key Components of Web 3.0 Open Technologies

Open Source Creative Commons Open Archive Initiative

Distributed Databases Structured data records in reusable and

searchable formats Standardized query language –

SPARQL technologyRDF structured databases

Semantic Search Systems Understand the user’s query Understand Web text Bring these together for query results

that are contextually relevant Algorithms that match the meanings

and not just the words Natural Language Processing Concept Mapping

Expanding processing capabilities XML NLP Machine

learning Data mining Open source

WEB 2.0

Tags Comments Ratings Online

Communities Dublin Core RDF

Web 3.0

Semantic SearchHeadup

Twine

Mahalo

The new Yahoo

and more ...

Human judgment

New Services

Kosmix www.kosmix.com

Google interface Offers overview of results by document type

Basic Facts Reviews & OpinionsMedia People & CommunityShopping News

Extensive clustering by subject Blended results with thumbnails of images,

video and audio clips, presentations and reports

Human-created “topic pages” for subjects of current interest

VideoCrawler

Video meta engine by AT&T Searches over 1,600 online video

sources byMedia typeRatingDate of creationPopularity

ChaChawww.chacha.com Free mobile search service Requires a (free) account Text your questions and a human

“guide” sends back an answer, limited to 160 characters

Supported by 98% of mobile providers

Highlights of this year at Google

Universal Search

Results from G.’s verticals blended into Web results and ranked together (Google as Portal?)

Launched 2007Web Search News

Book Search VideoImages Blog Search

Local/Maps Product Search

Under the Hood

G “knows” 1 trillion web-based items (8/13/08) Not indexed:

data records calendar pages other auto-generated content duplicates

Supplemental Index incorporated into main index (1/3/08)

Results Ranking Beyond Word Frequency, Links & PageRank Algorithms incorporating

Language – interpreting phrases, synonyms, diacritics, spelling mistakes, etc.

Query – language used to express the search

Time – returning pages with freshness appropriate to the query

Personalization – Not all people want the same set of results

Personalization for each searcher New ranking process based on

Wording of the query Text relationships of the pages retrieved Location Recent searches executed at that computer Functions with or without a Google account “Focusing on the user’s intent and location” On-screen message-

Customized based on recent search activity Launched mid-November ‘08

Evaluating Search Automated evaluations every minute Monitoring users click behavior (anonymous or

personalized via Google account) Google quality raters – hired to execute and

evaluate specific queriesVital Not RelevantUseful Off-topicRelevant

Google Quality Raters Guidelines 2007http://www.mauriziopetrone.com/blog/wp-content/uploads/quality-rater-guidelines-2007.pdf

Over 450 improvements launched in 2007

New features and collections SearchWiki - customize search by re-ranking,

deleting, adding, and commenting on search results – requires free account

“Voice Search” - mobile app for iPhone allows queries to be spoken then run in Google

Life Photo Archive hosted in Image SearchImages from 1790 to today, most never before

publishedSearchable by category and decadehttp://images.google.com/hosted/life orAdd source:life to any term(s) in Image Search

Google Books 2 divisions Partner (Publisher)Program and Library

Project

Partner Program Publishers authorize G. to scan and make

searchable the full text of their books Users see only the page containing their

search terms Link to purchase copy No download possible

Google Books 2 divisionsPartner Program and Library Project

Library Project (2004) Scan and make searchable millions of books For works in copyright, users see only a few

sentences around search terms Users may browse full text of public domain

works NOTE: Not possible to print ANY material from

either Google Print project

Current Member Libraries

U. of California HarvardU. of Michigan NYPLOxford StanfordU. of Virginia U. of WisconsinU. of Texas (Austin)U. Complutense (Madrid)PrincetonBayerische Staatsbibliothek

Copyright and the World of Books

The set of ALL books that are in the public domain

The set of ALL books that are copyrighted and in print (mostly in the Partner Program)

ALL other books are still in copyright, but out of print

Google Books in the Future

G. will sell electronic access to in-copyright, out-of-print works, with permission of publisher and author

37% of revenue to Google 63% of revenue to publisher and author In-print, in-copyright works scanned

through the Library project will be full-text searchable. No portion of these works will be shown

unless they are also part of the Partner Program

Google Books in the Future

Public domain works will continue to be fully available as before

Individuals and institutions may purchase full online access to all Google Book titles

“Public and university libraries” will be able to offer free access to out-of-print, in-copyright titles through “designated terminals”.

http://books.google.com/agreement

Google’s Custom Search www.google.com/coop/cse/

A tool from the Google Coop initiative Keywords chosen determine content and

weighting of results (limit of 100 characters) Search

Entire web Your selected sites only Entire web with selected sites emphasized

Within Coop, a CSE can be created and maintained collaboratively

Stored or Linked versions available

The Latest at Established Services

Why do I need more than Google? The Google effect --

the single most powerful force in today’s Internet

a private profit-driven company owns more information on individuals’

search behavior, companies and organizations than any other entity

Why do I need more than Google?

Great potential for misuse/abuse of this information for financial gain

Societies seldom leave basic services (utilities, medical and traffic regulation) totally to the “free market”

Is web search now a “basic service” ???

Search dominance ---

Potential skewing for commercial, political or social purposes Database composition Ranking Privacy

No single search engine can crawl the whole web

Limits search features, results display, consumer and shopping information

http://google-watch.org

Yahoo Open Strategy

Y!OS – major internal and external redesign to unify all Yahoo’s services

Owns Flickr, del.icio.us, Upcoming “We are building social into everything we

do” Offers more control over what is shared Easier to set up small social networks Will open some search technology to

developers and users (http://developer.yahoo.com/search/boss/)

Yahoo and the Semantic Web Will begin to include certain metadata

embedded in web pages as search and ranking elements Dublin Core hCard Creative Commons hCalendar RDF hReview GeoRSS hAtom

Will support Open Search specification allowing crawler access to deep web resources (!!!)

Yahoo Pipes - pipes.yahoo.com

Users can combine, filter and display any RSS content

Finished “pipes” can be shared and embedded in other web pages eg. A pipe for RSS feeds from

educational blogs flitering for technology, physics or any other keywords

Version available for the iPhoneiphone.pipes.yahoo.com

MSN’s Live.com

Database increasing Simpler Interface (4/08) “Rich Answers” blended results Image search enhancementsfilter:face filter:portrait filter:bw NLP question processing improved Live Search Books and Search

Academic ended 5/08

New & Notable at Ask

The Butler is gone! Teoma is in his place! Smart Search Web Answers Zoom Superior Mapping Tools

Gigablast

Maintains unique database Offers advanced search features “Freshness dating limit” estimates the

date that a particular page was first published or most recently edited or modified

Custom Topic Search of Gigablast – up to 500 domains (www.gigablast.com/cts.html)

Future directions in search

Social Search will continue to grow and adopt spam-prevention measures

Personalization will increase as searchers trade privacy for enhanced, customized results and alerting services

Yahoo’s new social-based service will combine Web 2.0 capabilities with a robust search engine

Open Source development of NLP, data and text mining software will continue to incorporate these capabilities into free services

Thank You!

Michael HunterReference Librarian

Hobart and William Smith CollegesGeneva, NY 14456

(315) 781-3552 [email protected]