Search and the ‘Net in 2009Trends, Challenges and Cutting-Edge
Developments in Internet Search
Michael HunterReference Librarian
Hobart and William Smith Colleges
For Rochester Regional Library CouncilMember Libraries’ Staff
Sponsored by the
Rochester Regional Library Council
Supported by Regional Bibliographic Databases and Resources Sharing (RBDB) funds granted by the New York State Library 2008
For today . . .
Landscape of Search in 2009 Web 2.0 and Social Search Web 3.0 – What, Where and What it can do The Year at Google New Services Recent Developments at Established Services Possible Directions for Search in the Future Linklist
http://people.hws.edu/hunter/searchnet09.htm
How large is the Web?Internet Systems Consortium www.isc.org
Web Search in 2008Who’s crawling the Web?
Google Yahoo Live Search (MSN) Ask owns Teoma Gigablast
Size Estimates 7/9/08Google AND Yahoo!text filetypes in millions
0
1000
2000
3000
4000
5000
6000
7000
8000
html htm
Yahoo!
Size Estimates 7/9/08Google AND Yahoo! text filetypes in millions
0
100
200
300
400
500
600
700
pdf doc xls ppt xlm/rss
Yahoo!
User SatisfactionForeSee Results and U. Michigan 8/14/07
66
68
70
72
74
76
78
80
82
Google Yahoo! Ask
'06
"07
Web 2.0 and Social Search
Services thatLet people form online communitiesFacilitate collaborationOperate across formats and platforms
Web Search Read-only
CONSUMERS CONTENT
Full-text keyword searchMachine-based ranking
Search Engine
Web 2.0 SearchRead-Write-Create-Describe-
Organize
Consumers
Consumers
Consumers
Creators
Creators
CreatorsTaggers Taggers
Taggers
Online
Communities
Online
Com
munities
Online
Communities
Creators
Searching Web 2.0
Enhances results beyond just full-text search
Users’ interactions within “community-based” systems can be used to infer query context and intent Return all postings relevant to “steep
slopes” Return only postings by community
members sharing the same interests eg. men with heart disease interested in “steep slopes”
Web 2.0 Applications
Flickr – Share and tag photos YouTube – Share, tag, comment and
subscribe to video content Del.icio.us –Share, tag, store and
organize web content Yahoo Answers – Ask, answer, rate,
comment on questions
Social Search Engines 2004-
Incorporate human-generated metadata in retrieval and ranking of search results Tags – from creators and
readers/viewers Comments, ratings, edits, deletions Searcher-supplied interest profiles Online community-based search engines
Eurekster, Mahalo, Wikia Search, Delver, me.dium.com, Stumbleupon, Thagoo (meta)
Potential Advantages
At least one human has selected and endorsed each result
Readers generate most tags, not authors
Reduces impact of link spam in ranking User input keeps results more current
and relevant
Concerns
User-generated spam Content vandalism Unique searches benefit less
“Long tail” effect
Wikiasari: Quick rummaging search
“User ranked results” Open source SE by Jimmy Wales and
Amazon Initial results ordered with algorithms a la
Google Users reorder results, which will be used
in ranking of future similar searches Edits allowed on all search results Strength is in general search topics
Web 3.0 – The Brave New Web
Web 3.0 – The Brave New Web
Web-based applications or services that combine metadata and artificial intelligence to provide a more productive and intuitive user experience
First used by John Markoff of the New York Times in 2006
Key Components of Web 3.0Expanded from Wikipedia “Web 3.0” (Nova Spivack) http://en.wikipedia.org/wiki/Web_3.0
Semantic Web – Embedded Metadata Intelligent Applications Human Judgment and Analysis Expanded Network Computing Open Technologies – Open Source Distributed Databases
Key Components of Web 3.0
Embedded metadata (Semantic Web) Dublin Core hCard Creative Commons hCalendar RDF hReview GeoRSS hAtom
<title>Expressing Dublin Core in HTML/XHTML meta and link elements</title><link rel="schema.DC" href="http://purl.org/dc/elements/1.1/" /><meta name="DC.title" content="Expressing Qualified Dublin Core in HTML/XHTML meta and link elements" /><meta name="DC.description" content="This document is most recent version of Expressing Dublin Core in HTML/XHTML meta and link elements." /><meta name="DC.creator" content="Powell, Andy" /><meta name="DC.publisher" content="Dublin Core Metadata Initiative" />
Key Components of Web 3.0
Intelligent applications Natural Language Processing (NLP) Information extraction Machine learning Data mining
Natural Language Processing (NLP)
With NLP software unstructured text and data can be processed to reveal degrees of meaning by Extracting terms identified as significant Summarizing content Discovering relationships among terms
and groups of terms HOW???
NLP Extraction Take all articles from a group of
pharmaceutical journals published in one year (the “corpus”)
Extraction – Run a relevant controlled vocabulary (list of all known drugs) against the corpus
NLP Extraction
Drugs found, number of occurrences and location in the corpus plus a list of possible drugs not in the controlled vocabulary
86>penicillin click for locations124>tetracycline click for locations213>aspirin click for locationsAre these also drugs? XXX, XXX, XXX
NLP Summarization
Retain phrases surrounding the extracted term(s) with links to locations in the corpus (KWIC Index)
rare uses of penicillinOften penicillin is contraindicated
whenresponds well to penicillin
NLP Summarization
Tag all words in the corpus with their grammatical function and search for noun – verb – noun and other syntactic patterns
(drug A) treats (disease B)(drug C) causes (disease B)(drug D) is contraindicated in (disease
B)
NLP Term Relationship
Queries answered by tracking references across sentences
Can penicillin cause shock?
“Penicillin treatment is not without risks. In certain cases it can trigger anaphylactic shock.”
NLP can do even more …
Word disambiguationbank (river) bank (finances) bank
(verb) Retrieval of alternative word forms Retrieval of variants in capitalization and
spelling Topic detection and tracking
Following different themes in a changing RSS feed
Machine translation
Key Components of Web 3.0 Human Judgment and Analysis
Web 2.0 - Selection, tagging, rating, comment
Expanded Network Computing Distributed Computing Cloud Computing Grid Computing Interoperability
Key Components of Web 3.0 Open Technologies
Open Source Creative Commons Open Archive Initiative
Distributed Databases Structured data records in reusable and
searchable formats Standardized query language –
SPARQL technologyRDF structured databases
Semantic Search Systems Understand the user’s query Understand Web text Bring these together for query results
that are contextually relevant Algorithms that match the meanings
and not just the words Natural Language Processing Concept Mapping
Expanding processing capabilities XML NLP Machine
learning Data mining Open source
WEB 2.0
Tags Comments Ratings Online
Communities Dublin Core RDF
Web 3.0
Semantic SearchHeadup
Twine
Mahalo
The new Yahoo
and more ...
Human judgment
New Services
Kosmix www.kosmix.com
Google interface Offers overview of results by document type
Basic Facts Reviews & OpinionsMedia People & CommunityShopping News
Extensive clustering by subject Blended results with thumbnails of images,
video and audio clips, presentations and reports
Human-created “topic pages” for subjects of current interest
VideoCrawler
Video meta engine by AT&T Searches over 1,600 online video
sources byMedia typeRatingDate of creationPopularity
ChaChawww.chacha.com Free mobile search service Requires a (free) account Text your questions and a human
“guide” sends back an answer, limited to 160 characters
Supported by 98% of mobile providers
Highlights of this year at Google
Universal Search
Results from G.’s verticals blended into Web results and ranked together (Google as Portal?)
Launched 2007Web Search News
Book Search VideoImages Blog Search
Local/Maps Product Search
Under the Hood
G “knows” 1 trillion web-based items (8/13/08) Not indexed:
data records calendar pages other auto-generated content duplicates
Supplemental Index incorporated into main index (1/3/08)
Results Ranking Beyond Word Frequency, Links & PageRank Algorithms incorporating
Language – interpreting phrases, synonyms, diacritics, spelling mistakes, etc.
Query – language used to express the search
Time – returning pages with freshness appropriate to the query
Personalization – Not all people want the same set of results
Personalization for each searcher New ranking process based on
Wording of the query Text relationships of the pages retrieved Location Recent searches executed at that computer Functions with or without a Google account “Focusing on the user’s intent and location” On-screen message-
Customized based on recent search activity Launched mid-November ‘08
Evaluating Search Automated evaluations every minute Monitoring users click behavior (anonymous or
personalized via Google account) Google quality raters – hired to execute and
evaluate specific queriesVital Not RelevantUseful Off-topicRelevant
Google Quality Raters Guidelines 2007http://www.mauriziopetrone.com/blog/wp-content/uploads/quality-rater-guidelines-2007.pdf
Over 450 improvements launched in 2007
New features and collections SearchWiki - customize search by re-ranking,
deleting, adding, and commenting on search results – requires free account
“Voice Search” - mobile app for iPhone allows queries to be spoken then run in Google
Life Photo Archive hosted in Image SearchImages from 1790 to today, most never before
publishedSearchable by category and decadehttp://images.google.com/hosted/life orAdd source:life to any term(s) in Image Search
Google Books 2 divisions Partner (Publisher)Program and Library
Project
Partner Program Publishers authorize G. to scan and make
searchable the full text of their books Users see only the page containing their
search terms Link to purchase copy No download possible
Google Books 2 divisionsPartner Program and Library Project
Library Project (2004) Scan and make searchable millions of books For works in copyright, users see only a few
sentences around search terms Users may browse full text of public domain
works NOTE: Not possible to print ANY material from
either Google Print project
Current Member Libraries
U. of California HarvardU. of Michigan NYPLOxford StanfordU. of Virginia U. of WisconsinU. of Texas (Austin)U. Complutense (Madrid)PrincetonBayerische Staatsbibliothek
Copyright and the World of Books
The set of ALL books that are in the public domain
The set of ALL books that are copyrighted and in print (mostly in the Partner Program)
ALL other books are still in copyright, but out of print
Google Books in the Future
G. will sell electronic access to in-copyright, out-of-print works, with permission of publisher and author
37% of revenue to Google 63% of revenue to publisher and author In-print, in-copyright works scanned
through the Library project will be full-text searchable. No portion of these works will be shown
unless they are also part of the Partner Program
Google Books in the Future
Public domain works will continue to be fully available as before
Individuals and institutions may purchase full online access to all Google Book titles
“Public and university libraries” will be able to offer free access to out-of-print, in-copyright titles through “designated terminals”.
http://books.google.com/agreement
Google’s Custom Search www.google.com/coop/cse/
A tool from the Google Coop initiative Keywords chosen determine content and
weighting of results (limit of 100 characters) Search
Entire web Your selected sites only Entire web with selected sites emphasized
Within Coop, a CSE can be created and maintained collaboratively
Stored or Linked versions available
The Latest at Established Services
Why do I need more than Google? The Google effect --
the single most powerful force in today’s Internet
a private profit-driven company owns more information on individuals’
search behavior, companies and organizations than any other entity
Why do I need more than Google?
Great potential for misuse/abuse of this information for financial gain
Societies seldom leave basic services (utilities, medical and traffic regulation) totally to the “free market”
Is web search now a “basic service” ???
Search dominance ---
Potential skewing for commercial, political or social purposes Database composition Ranking Privacy
No single search engine can crawl the whole web
Limits search features, results display, consumer and shopping information
http://google-watch.org
Yahoo Open Strategy
Y!OS – major internal and external redesign to unify all Yahoo’s services
Owns Flickr, del.icio.us, Upcoming “We are building social into everything we
do” Offers more control over what is shared Easier to set up small social networks Will open some search technology to
developers and users (http://developer.yahoo.com/search/boss/)
Yahoo and the Semantic Web Will begin to include certain metadata
embedded in web pages as search and ranking elements Dublin Core hCard Creative Commons hCalendar RDF hReview GeoRSS hAtom
Will support Open Search specification allowing crawler access to deep web resources (!!!)
Yahoo Pipes - pipes.yahoo.com
Users can combine, filter and display any RSS content
Finished “pipes” can be shared and embedded in other web pages eg. A pipe for RSS feeds from
educational blogs flitering for technology, physics or any other keywords
Version available for the iPhoneiphone.pipes.yahoo.com
MSN’s Live.com
Database increasing Simpler Interface (4/08) “Rich Answers” blended results Image search enhancementsfilter:face filter:portrait filter:bw NLP question processing improved Live Search Books and Search
Academic ended 5/08
New & Notable at Ask
The Butler is gone! Teoma is in his place! Smart Search Web Answers Zoom Superior Mapping Tools
Gigablast
Maintains unique database Offers advanced search features “Freshness dating limit” estimates the
date that a particular page was first published or most recently edited or modified
Custom Topic Search of Gigablast – up to 500 domains (www.gigablast.com/cts.html)
Future directions in search
Social Search will continue to grow and adopt spam-prevention measures
Personalization will increase as searchers trade privacy for enhanced, customized results and alerting services
Yahoo’s new social-based service will combine Web 2.0 capabilities with a robust search engine
Open Source development of NLP, data and text mining software will continue to incorporate these capabilities into free services
Thank You!
Michael HunterReference Librarian
Hobart and William Smith CollegesGeneva, NY 14456
(315) 781-3552 [email protected]