17
1 Searching JACo PDF files on the web Pascal Le Roux JACo Team Meeting Thoiry, France, 18-19 February 2002

18-19 February 2002 the webaccelconf.web.cern.ch/AccelConf/jacow/TM_2002_CERN/Talks/Searching...Searching JACo s on the web Pascal Le Roux JACo Team Meeting Thoiry, France, 18-19 February

Embed Size (px)

Citation preview

1

Searching JACo PDF files on the web

Pascal Le RouxJACo Team Meeting

Thoiry, France,18-19 February 2002

2

Status of the CERN JACoW Site• The CERN Joint Accelerator Conference Web site is hosted on the CERN

central web servers (a pool of ≈ 10 machines running Windows 2000 Server + SP2+ (x thousand) patches with Microsoft Internet Information Services 5.0 web server).

• 10 conferences are published on this site:– 4 PAC (1995, 1997, 1999, 2001)– 3 EPAC (1996, 1997, 2000)– 1 APAC (1998)– 1 ICALEPC (1999)– 1 LINAC (1998)⇒About 8000 PDF files.

• We recently received the CDs from Cyclotrons 2001 and Linac’96 but the PDF files are not yet “JACoW compliant” (files not cropped, no keywords…)

3

A tool is required to search papers !• The CERN JACoW web site provides a search form which serves as a

custom interface of the search engine: http://accelconf.web.cern.ch/AccelConf/top-page.html

4

• Once you click on the Go! Button, the form is sent to an ASP script that parses the fields, and formats the query string which is redirected to the CERN global search engine. The query looks like this:

http://search.cern.ch/query.html?col=cern&qp=&qt=%2Burl%3Aaccelconf+-url%3Aabstract+-site%3Aaps.anl.gov+%2Bdoctype%3Apdf+%2Btitle%3Amagnet&qs=&qc=

cern&pw=600&ws=0&qm=0&st=1&nh=10&lk=1&rf=0&rq=0

• This customized query string restricts the search to PDF files published on the JACoW site and specifies where (in which hidden field) to search for the words entered by the user.

5

• Once the engine gets a bunch of matches, it sorts them according to a relevance ranking or by date before sending back the customized result page.

6

CERN Global Search Engine

• Since 1997, CERN has used Infoseek Ultraseek search engine, running on a Sun Ultra 1 with Sun OS 5.6.

• In 2000, Inktomi acquired Infoseek Corporation. Inktomi is a leader in the web-wide search market, providing results for major sites such as:– MSN Search, Yahoo, Oracle, IBM…

and…Fermi National Accelerator Laboratory

• In November 2001, CERN upgraded its search engine from Ultraseek 4.08 to Inktomi Enterprise Search 4.2.

7

• Product changesBasically, the main product changes are bug and security fixes, cosmetic changes for the users, supports of direct indexing of Oracle and other ODBC compliant databases, plus indexing of NTFS file sources andimprovements in International support.

• Platform and performanceThe search engine now runs on a PC with Dual 500Mhz CPU, 1GB of RAM, 70 GB SCSI drive, Windows 2000 server + SP2 (but it’s also available for Sun and Linux)This platform can indexed the CERN Intranet :– approximately 1 million documents– Every 3 / 7 days – Answers about 1000 queries per day – With peaks up to 200 queries / hour

8

• Specifications– Inktomi Enterprise Search supports:

• HTML, XML, Text, RTF, MS OFFICE, PDF (search in hidden fields, and full text search), PostScript, Framemaker, Lotus, WordPerfect

• In English, French, German, Spanish, Portuguese, Italian, Dutch, Swedish, Norvegian, Danish, Finnish, Chinese and Japanese.

– In addition to the full PDF text indexation, the engine can also index PDF metadata (our hidden fields: Title, subject, author, keywords). As a result, the search results are therefore more accurate than a simple full text search.

– The search result page provides :• Linked results titles to the PDF doc.• Smart Summaries• Path and Size of the PDF file

– The results can be sorted by date or by relevance ranking.

• Comments from the staff who installed the search engine“CERN has not done any evaluation since 1997, except for Microsoft SharePoint (2001) which was not adapted for CERN needs, but we can recommend Inktomi as it requires little work and gives reasonable results.”

9

• Price$2,995 for 1-3,000 pages, $7,495 to 10,000 pagesBut CERN IT people told me: “We had a nice price from Inktomi. I cannot tell you how much… This was our main reason to purchase this product as the IT budget is small…”

10

Is there an alternative to the Inktomi Enterprise Search locally?

• Hundreds of other search services/products are available on the market.

• But they do not always suit PDF searches. Some tools are not capable to index the text contained in the PDF hidden fields.

11

Local search tool, Remote Search service?

• Local search toolThis is the solution described previously.You have to purchase :– the search engine software.– A powerful machine dedicated to this indexing and search service.– An administrator who takes care of the system 24 hours a day.

CERN has selected Inktomi mainly because they got a really interesting price for such a product.

But of course, many products are available on the market.

Since I didn’t make any product evaluation, I can’t rate them without serious testing. I can only give you a list of leading product according to articles found on the web…

12

More exhaustive list at : http://www.searchtools.com/info/pdf.html

Optimized to support thesearching of

PDF hidden fields + 16 more custom fields

Windows NT 4 / 2000 +

Microsoft Internet Information Server

?Elan Web Search

A search enginespecially designed for PDF

Windows 95/98/NT4/2000$7,500 PDF WebSearch

(based on dtSearch)

Adobe PDF IFilter 5.0extends the searchcapabilities of MIS

by indexing allthe hidden fields

Windows NT (Server only, not Workstation), Windows 2000

≈ Free: integrated with Microsoft Internet Information Server

and the Windows NT® Server 4.0Free Adobe PDF IFilter 5.0

Microsoft IndexServer

+Adobe PDF IFilter

5.0

Windows NT/2000; Unix: Solaris 2.5 and above, Linux,

HP-UX 11.0

$2,995 for 1-3,000 pages$7,495 to 10,000 pages

Inktomi Enterprise Search

Google-specific Linux on supplied hardware$20,000 for 1x rack mountable box

(150,000 documents)Google Search

Appliance

• Handle over 200 files formats. Including XML, PDF, PostScript, MS Office• Support about 30 languages• Can index ≈ 10000 files / hour

Windows NT, Windows 2000, Tru64 UNIX, HP/UX, Solaris,

Linux

$15,000 for smaller companies to millions for large corporations!!

AltaVista Enterprise Search

SpecificationsPlatform supportedPriceProduct

13

• Remote search servicesIn this case, you just have to sign up for one of the various search services available online. Some of them are free, completely supported by advertising.Advantages– You don’t have to worry about the work involved in setting up a search engine.– No expensive software to buy.– No machine to maintain– No technician to pay for taking care of the service.– Remote search engines work just as well as local ones.

DrawbacksYou don’t have as much control:– On the indexing process. You do not know how often your site is indexed.

(Sometimes it can take many weeks for free services…)– On the search engine accessibility and response time.– On the design of the search result pages (advertising…)– If you pay the services and have a lot of pages to index. Local searching

solution can be really cheaper.

14

PDF indexing available only for paid version. Daily for paid version

- Free with advertising $79 per month for 5,000

pagesFreeFind Enterprise

Google controls scheduling (≈ 1 month for

free version)

- Free with Google Logo and limited customization.

- Paid version offers many more options...

Google

•No advertising just an Atomz logo.•15 languages supported•Indexes and searches hidden fields in PDF

Weekly and on demand $10,000 per years and up depending on the number

of domains and pagesAtomz Enterprise

CommentsIndexing frequencyPriceProduct

For a more exhaustive list, have a look at: http://www.searchtools.com/info/pdf.html

15

Example of remote search service using Google web wide search engine

• Since our CERN web servers are indexed by the Google web wide search engine. I’ve duplicated the JACo search form to test Google.

• In the free version of Google: you can’t create precise query using title and keywords fields. You can only perform full text searches or author field searches.

• But you can restrict the search to a given domain (http://accelconf.web.cern.ch/AccelConf/) and a given file type (PDF), to search only the PDF files located on our JACoW site.

16

• The result page is quite similar the Inktomi one, with an interesting feature: the possibility to get an HTML version of the PDF.

• The PAC 2001 papers which were added on the site mid January are not yet indexed! (Like a few EPAC 2000 papers…) (It took 3 days to be indexed by Inktomi).

17

My Conclusions• We (the JACo team at CERN) don’t have to worry about the search engine

tool. An administrator has installed and upgraded the system for us, and keeps the machine and the software up 24 hours a day…

• The indexation is done quite often (maximum of 7 days)• The only things to do were to create the HTML form and the ASP script

and of course, upload all the files on a web server.• Since November 2001 (when the search engine was upgraded), we have

received about 3600 hits on the JACo search form.• We never received any complains from the users of the CERN instance

(Yes, this doesn’t mean that the service is fine…)• I don’t think that the CERN JACoW site needs another search engine. This

service is sufficient.• It could be used at FNAL since they already have the same search

engine… ;-)