Site-wide Search
Upgrade and new features
Jon Warbrick
University of Cambridge Computing Service
Site-wide search
● web-search.cam.ac.uk
Site-wide search
● web-search.cam.ac.uk
● Ultraseek, from Infoseek
Site-wide search
● web-search.cam.ac.uk
● Ultraseek, from Infoseek -> Inktomi
Site-wide search
● web-search.cam.ac.uk
● Ultraseek, from Infoseek -> Inktomi -> Verity
Site-wide search
● web-search.cam.ac.uk
● Ultraseek, from Infoseek -> Inktomi -> Verity
-> Autonomy
Site-wide search
● web-search.cam.ac.uk
● Ultraseek, from Infoseek -> Inktomi -> Verity
-> Autonomy
● Currently indexing
– ~600 servers
– ~1.2 million documents
– ~2.5 million URLs
Site-wide search
● Indexes 'more-or-less official' servers
Site-wide search
● Indexes 'more-or-less official' servers
● Maintains two indexes
– 'internal' and 'external'
– automatically routes queries
Site-wide search
● Indexes 'more-or-less official' servers
● Maintains two indexes
– 'internal' and 'external'
– automatically routes queries
● Services for University Webmasters
– Add/delete/re-index
– Packaged searches
2006 Upgrade
● Improved resilience
2006 Upgrade
● Improved resilience
● Case-inSenSITIVE matching
2006 Upgrade
● Improved resilience
● Case-inSenSITIVE matching
● Quick Links
2006 Upgrade
● Improved resilience
● Case-inSenSITIVE matching
● Quick Links
2006 Upgrade
● Improved resilience
● Case-inSenSITIVE matching
● Quick Links
● Passage-based summaries
2006 Upgrade
● Improved resilience
● Case-inSenSITIVE matching
● Quick Links
● Passage-based summaries
2006 Upgrade
● Improved resilience
● Case-inSenSITIVE matching
● Quick Links
● Passage-based summaries
● Grouping by location
2006 Upgrade
● Improved resilience
● Case-insensitive matching
● Quick Links
● Passage-based summaries
● Grouping by location
2006 Upgrade
● Improved resilience
● Case-inSenSITIVE matching
● Quick Links
● Passage-based summaries
● Grouping by location
● [ All terms matching ]
2006 Upgrade
● More indexing (dynamic pages + https +
JavaScript)
2006 Upgrade
● More indexing (dynamic pages + https +
JavaScript)
2006 Upgrade
● More indexing (dynamic pages + https +
JavaScript)
2006 Upgrade
● More indexing (dynamic pages + https +
JavaScript)
● Sources of indexing requests
– s1.web-search.cam.ac.uk -
s6.web-search.cam.ac.uk
– an address in the range 192.153.213.0-255
2006 Upgrade
● More indexing (dynamic pages + https +
JavaScript)
● Sources of indexing requests
– s1.web-search.cam.ac.uk -
s6.web-search.cam.ac.uk
– an address in the range 192.153.213.0-255
● Backup search engines
– Add URL, Revisit Site, etc.
Problems with dynamic content
Problems with dynamic content
● Randomly permuted
query arguments
● Gratuitously-varying
detail
● Variant pages
● Calendars linking to other
pages
● Cache-busting headers
● Frames hiding real URL
● Junk path info
● 'Success' error pages
● Lack of Last Modification
time stamp
● Inconsistent URLs
Further information
● Notes for webmasters:
http://www.cam.ac.uk/cs/web-search/
● Details of recent changes:
http://www.cam.ac.uk/cs/web-search/changes-200608.html
● Help and advice:
If you have been, thanks for listening
I wonder if anyone will ask...
I wonder if anyone will ask...
“Why don't you use Google?”