Trier, 12. März, 2001
http://www.zib.de/[email protected]
Konrad-Zuse-Zentrum für Informationstechnik Berlin (ZIB)
Martin Grötschel
On the Road to Scientific Information Portals:Cooperative Digital Libraries
Remarks, Visions, Proposals
Martin Grötschel
IuK 2001, Universität Trier
Konrad-Zuse-Zentrum für Informationstechnik Berlin Martin Grötschel
Contents
IntroductionI. All Information is Part of the Web
Can we make this true?
II. The Visible Web and the Deep WebIII. There could be an Interconnected
Network of Science IV. Integrating All Types of ResourcesV. We should Organize the Cyber SpaceVI. To the Benefit of our Society
Konrad-Zuse-Zentrum für Informationstechnik Berlin Martin Grötschel
Contents
IntroductionI. All Information is Part of the Web
Can we make this true?
II. The Visible Web and the Deep WebIII. There could be an Interconnected
Network of Science IV. Integrating All Types of ResourcesV. We should Organize the Cyber SpaceVI. To the Benefit of our Society
Konrad-Zuse-Zentrum für Informationstechnik Berlin Martin Grötschel
Personal Motivation• I have broad interests.• I (have to) search a lot.• I do find things I look for.• However, this process costs too much
time and money.• The „scientific information system“ could be much better.• It seems that some scientists have to get
involved.• The situation is similar with respect to
communication.
Konrad-Zuse-Zentrum für Informationstechnik Berlin Martin Grötschel
Acting Forces• Science drives Technology• Technology drives Change• Change induces Pressure
Some Consequences:• Higher Speed and Efficiency • Lower Costs• Universal Connectivity• More and Global Competition
What does this imply for Science?
Konrad-Zuse-Zentrum für Informationstechnik Berlin Martin Grötschel
The World of Information• Tons of Printed MaterialZillions • of Scientific Web Sites• of E-Journals, E-Prints• of Databases and CD-Roms• of Multimedia Documents• of E-Mail• of Digital Photos and Videos• etc.
Konrad-Zuse-Zentrum für Informationstechnik Berlin Martin Grötschel
The Players
• The Author• The Publisher• The Librarian• The Software Developer• The Service Provider• The Scientific Information Center• The Scientific Society• etc.
the
user
Konrad-Zuse-Zentrum für Informationstechnik Berlin Martin Grötschel
Some Unsolved Issues
• Accessability• Searchability• Stability• Compatibility• Pricing• Heterogeneity• Diversity and
Complexity of Structures
• Quality• Authenticity• etc.
Konrad-Zuse-Zentrum für Informationstechnik Berlin Martin Grötschel
Solution
• Scientists have to get involved• Solution must be user driven• Cooperation of players• Consensus about structures
Some Suggestions in this Talk
Konrad-Zuse-Zentrum für Informationstechnik Berlin Martin Grötschel
Contents
I. All Information is Part of the WebCan we make this true?
Konrad-Zuse-Zentrum für Informationstechnik Berlin Martin Grötschel
Current Mathematical Resources
• Papers and Preprints• Journals and Books• Reviews and Abstracts • Software and Data Collections• Projects and Persons• Voice, Images, and Video Information• Links, Mail, and Virtual Libraries
Konrad-Zuse-Zentrum für Informationstechnik Berlin Martin Grötschel
Math Papers and Preprints
• Preprints of the Math-Net• MPRESS (including ArXiv math,...)• EULER• Digital Library @ ACM
Konrad-Zuse-Zentrum für Informationstechnik Berlin Martin Grötschel
Math Journals and Books
• SUB Göttingen („Sondersammelgebiet“)• TIB Hannover (Tech Information Library)• ELib @ Uni Osnabrück • EMIS• Springer LINK• DOCUMENTA MATHEMATICA• Lehmanns.de
Konrad-Zuse-Zentrum für Informationstechnik Berlin Martin Grötschel
Math Reviews and Abstracts
• MATH @ Zentralblatt• MathSci @ AMS• MATHDI @ FIZ-Karlsruhe• Jahrbuch der Mathematik
Konrad-Zuse-Zentrum für Informationstechnik Berlin Martin Grötschel
Math Software and Data Collections
• Netlib @ ANL• eLib @ ZIB• MuPad @ Uni Paderborn• Algebraic Groups• Cinderella• OpenMath
Konrad-Zuse-Zentrum für Informationstechnik Berlin Martin Grötschel
Projects and Persons
• Web Sites of Math Research Institutes• Web Sites of Math Departments• BerNAM• Directory of Mathematicians @ ACM• Comb. Membership List AMS, SIAM,
MAA• PERSONA MATHEMATICA @ mat-net.de• SIGMA @ math-net.de
Konrad-Zuse-Zentrum für Informationstechnik Berlin Martin Grötschel
Voice, Images, and Video
• Computer Museum• MSRI Video Server• Electronic Geometric Models
Application Servers and Software• MATHEMATICA• Cinderella• Inverse Calculator
Konrad-Zuse-Zentrum für Informationstechnik Berlin Martin Grötschel
Links, Mail, and Virtual Libraries
• mathematik.de• Math-Net.de• Mathematical Archives• Opt-Net @ ZIB• MathML
Konrad-Zuse-Zentrum für Informationstechnik Berlin Martin Grötschel
There are zillions ofThere are zillions ofMath Resources in the Math Resources in the
Net.Net.
Konrad-Zuse-Zentrum für Informationstechnik Berlin Martin Grötschel
The Situation is Similar in all other Sciences
How do you know that all this material exists and where it is?
Old Approach: Link Lists = WWW Virtual Libraries
But, much more has come up in the recent years!
Konrad-Zuse-Zentrum für Informationstechnik Berlin Martin Grötschel
Is Everything in the Web?
• Printed Books• Printed Journals• CD-ROMs• Some Data Bases• Historic Archives• Catalog Cards• ...
are not electronically available
Konrad-Zuse-Zentrum für Informationstechnik Berlin Martin Grötschel
Is Everything from the Web in the Web?
Konrad-Zuse-Zentrum für Informationstechnik Berlin Martin Grötschel
Contents
I. All Information is Part of the WebCan we make this true?
II. The Visible Web and the Deep Web
Konrad-Zuse-Zentrum für Informationstechnik Berlin Martin Grötschel
The Invisible / Deep WebA fundamental Problem with Search Engines: A Vast Amount of Information is Invisible• Surface Web / Web Robots Start at some „Hubs“
Interlinked Web Pages
• Deep Web Isolated Web Sites There are huge Isolated Islands in the Web Information within Databases, behind CGI Interfaces Information without Links (e.g. within OPACs of Libraries) Protected Material, Excluded Explicitly
Konrad-Zuse-Zentrum für Informationstechnik Berlin Martin Grötschel
A Web Search Engine Collecting Visible Information
From „The Deep Web: Surfacing Hidden Value; BrightPlanet.com, Jan-2000“
Konrad-Zuse-Zentrum für Informationstechnik Berlin Martin Grötschel
A Direct Meta Search Engine Fishing for Invisible
Information
From „The Deep Web: Surfacing Hidden Value; BrightPlanet.com, Jan. 2000“
Konrad-Zuse-Zentrum für Informationstechnik Berlin Martin Grötschel
Characteristics of the Deep Web
- in Comparison to the Visible Web -
• Public information is currently 400 to 500 times larger than the commonly defined World Wide Web
• 7,500 terabytes of information (550 Billion individual documents), compared to 19 terabytes (1 Billion documents)
From:The Deep Web: Surfacing Hidden Value; BrightPlanet.com, Jan 2000
Konrad-Zuse-Zentrum für Informationstechnik Berlin Martin Grötschel
Characteristics of the Deep Web
- in Comparison to the Visible Web -• More than 100,000 Deep Web sites currently exist60 of the largest Deep Web Sites collectively
contain about 750 terabytes of Information (... narrower, with deeper content)
More than half of the Deep Web content resides in topic specific databases (BrightPlanet concentrates on about 20,000 sites)
• A full 95% of the Deep Web is publicly accessible information – not subject to fees or subscriptions
• The Deep Web is the largest growing category of new information on the Internet. But theDeep Web is widely unknown.
From:The Deep Web: Surfacing Hidden Value; BrightPlanet.com, Jan 2000
Konrad-Zuse-Zentrum für Informationstechnik Berlin Martin Grötschel
Making the Deep Web VisibleTechnology:• Meta Search Engines• Bibliographic Meta Search Engines• Virtual Catalogs and Link ListsOrganisational Issues:• Building Networks of Digital Libraries• Forming Library and other Cooperatives• Working on Standards and Formats
(Common, Open, Metadata,...)
Konrad-Zuse-Zentrum für Informationstechnik Berlin Martin Grötschel
Categories of Information Systems
• Web Sites – Collection, Query Interface• Publications – E-Journals, Preprints, ...• Regional/Nat. Collections – Harvesting Systems• Topical Databases – Subject Specific Aggregation• OPACs – Library Holdings • Journal Archives – Archive of Publishers Software/Data
Collection – Commercial / Public Archive• Compute Servers – Math. Calculations /Demos• Mailing Lists/Archive – Topical Communication Forum• Topical Portals – Wide Spectrum Information System
Konrad-Zuse-Zentrum für Informationstechnik Berlin Martin Grötschel
Problems: Wide Variety of Servers
Problems with Search Engines (Web Robots) Impose High Load on Servers and Networks Perverted use of Metadata Robots can‘t see behind CGI-Interfaces Access Rights, Range of Licenses
Problems with Cascading Search Engines Diversity of data formats (MAB, MARC Formats,
DC, ...) Multitude of protocols (Z39.50, HTTP, proprietary)
Specialized Repositories and Archives Scientific Journals provided by Commercial Publishers Document Delivery Systems and Specialized Historic
Archives Maps, Music, Photos, Videos, Multimedia
Konrad-Zuse-Zentrum für Informationstechnik Berlin Martin Grötschel
Contents
I. All Information is Part of the WebCan we make this true?
II. The Visible Web and the Deep Web
III. There could be an Interconnected Network of Science
Konrad-Zuse-Zentrum für Informationstechnik Berlin Martin Grötschel
Virtual/Digital Library
• VirtualSearch indexLinksMetadataOPAC catalog
entries
• DigitalStructured digital
contentsFull textsData bases
Konrad-Zuse-Zentrum für Informationstechnik Berlin Martin Grötschel
Towards a Scientific Portalto Interconnect the Digital
WorldVirtual Library
Information Portal: Cooperative Virtual
DigitalDigital Library Scientific Library
The Scientific Portal (Information Portal for the Sciences)
is an Entry Pointto all Types of Information Products from the
Sciences.Behind the Scientific Portal is a Structured
Networkto be coordinated and organized by the
Sciences in a cooperative way.A Task for the IuK Initiative?
Konrad-Zuse-Zentrum für Informationstechnik Berlin Martin Grötschel
Lots of Examplesalready exist
Konrad-Zuse-Zentrum für Informationstechnik Berlin Martin Grötschel
An Example in the Making
Virtuelle Fachbibliothek Technikder TIB Hannover
Konrad-Zuse-Zentrum für Informationstechnik Berlin Martin Grötschel
Example: The DOE Information Bridge
• Started in 1997 with 60.000 searchable full text reports online @ DOE Office of Scientific and Technical Information (OSTI)
• Direct Search based on the Distributed Explorer developed by a small Internet Company: Innovative Web Application Ltd. (IWA)
• A public version in partnership with the Government Printing Office (GPO) of the USA
• Many other Federal Deep Web collections added to the DOE Virtual Library PubScience PubMed NTIS Electronic Catalog (450,000 Titles) NASA Technical Report Server
• Energy Portal Search• Digitization efforts for Gray Literature (@ OSTI)
Konrad-Zuse-Zentrum für Informationstechnik Berlin Martin Grötschel
OSTI Virtual Library
Konrad-Zuse-Zentrum für Informationstechnik Berlin Martin Grötschel
PubScience
Konrad-Zuse-Zentrum für Informationstechnik Berlin Martin Grötschel
The GrayLit Information Network
Graphic from „Searching The Deep Web; W.L. Warnick et al.“D-Lib Magazine, Vol. 7, No. 1, January 2001; www.dlib.org
Konrad-Zuse-Zentrum für Informationstechnik Berlin Martin Grötschel
Preprint Network
Konrad-Zuse-Zentrum für Informationstechnik Berlin Martin Grötschel
DOE OSTI
Konrad-Zuse-Zentrum für Informationstechnik Berlin Martin Grötschel
Energy Portal Search
Konrad-Zuse-Zentrum für Informationstechnik Berlin Martin Grötschel
PubMed
Konrad-Zuse-Zentrum für Informationstechnik Berlin Martin Grötschel
NASA Image
Exchange
Konrad-Zuse-Zentrum für Informationstechnik Berlin Martin Grötschel
Federal R & D Architecture
Graphic from „Searching The Deep Web; W.L. Warnick et al.“D-Lib Magazine, Vol. 7, No. 1, January 2001; www.dlib.org
Konrad-Zuse-Zentrum für Informationstechnik Berlin Martin Grötschel
An Observation
The Voluntary Work contributed so far was and will stay important.
There will, however, be no satisfactory solution without substantial amounts of
personal and financial investment.
We need to become more professional,e.g., Google versus Math-Net.
Konrad-Zuse-Zentrum für Informationstechnik Berlin Martin Grötschel
Contents
I. All Information is Part of the WebCan we make this true?
II. The Visible Web and the Deep WebIII. There could be an Interconnected
Network of ScienceIV. Integrating All Types of Resources
Konrad-Zuse-Zentrum für Informationstechnik Berlin Martin Grötschel
Distributed Meta Search Engines Exist
What they do:• Query Search Engines, OPACs, Databases• Perform Distributed Searches in Parallel• Cascade Search to reach Large/Vast Amounts of Targets• Deliver Links, Metadata, and/or Full Texts• Handle a Diversity of Data Structures• Use a Multitude of Internet/Web Protocols• Structure Heterogeneous/Large Result Sets
They Rely on a Series of Small Configuration Files
Konrad-Zuse-Zentrum für Informationstechnik Berlin Martin Grötschel
Combination of Search Engines
Integration of Information Offers
SI
Browser
HTTP
SI
Browser
HTTP
Z-Client
Z39.50
Aleph: MAB, USMARC
Browser
HTTP
HTTP
DS
DigiBib: "Dublin Core"
Browser
HTTP
Z39.50
DS
DigiBib+WebPack, Euler,{Aleph}: DC,MAB,USMARC
AltaVista:HTMLMath-Net:Harvest+DC
Windows GUI
HTTP
DS
AltaVista HotBot InfoSeek
"Web Ferret": HTML?
Windows GUI
SI
Z39.50
DS
KOBVGBV DDB
Scout: UNIMARC, MAB2
{Z-Client}
As studied by J. Lügger in „Über Suchmaschinen, Verbünde und die Integration von Informationsangeboten“; ABI-Technik, June, 2000
• Math-Net: Harvest+DC• KOBV Search Engine
• Shared Index• Distributed Search• Shared Index
• EULER and Dublin Core• DigiBib NRW
Konrad-Zuse-Zentrum für Informationstechnik Berlin Martin Grötschel
A Potential Math Information Portal
SIHTTP
DigiBib with KOBV DigiBib with WebPack
Z39.50 withUNIMARC
Z39.50
DS
Browser
HTTP
HTTP
DS
Math-Net @ ZIB and @ Uni KölnSigmaNetLib SoftwarePersona Mathematica
EMS @ Zentralblatt für MathematikMATH, MATHDIJahrbuch für Mathematik
Universität OsnabrückELibMPRESS
Special Interest Groups of DMVOPT-NET, IM-Net, IuK, ...
Publishers and Software HousesE-Journals, Software
SUB GöttingenOPAC SSG Mathematik
TIB HannoverTIB CAT
CWI AmsterdamOPAC Mathematics
Mathematische Fachbereiche & InstituteSpecialized OPACs
Library CooperativesBVB, GBV, HBZ, KOBV, ...
Die Deutsche BibliothekAuthority Data
Publishers and Math SocietiesMath-Journals and -Document
DigiBibwithMath-Net
Z39.50with MAB2USMARC
OpenDistributedEfficientScalableStable
Konrad-Zuse-Zentrum für Informationstechnik Berlin Martin Grötschel
ContentsI. All Information is Part of the Web
Can we make this true?
II. The Visible Web and the Deep WebIII. There could be an Interconnected
Network of ScienceIV. Integrating All Types of ResourcesV. We should Organize the Cyber Space
Scientists should Organize the Scientific Cyberspace Cooperatively (Summary and Proposals)
Konrad-Zuse-Zentrum für Informationstechnik Berlin Martin Grötschel
Organizing the Cyberspace: Suggestions
• Partners for the information portal?• Who should form the information
portals?• Organizational framework?Cooperative Digital Libraries
Main Issues: Sustainability and Finance
Konrad-Zuse-Zentrum für Informationstechnik Berlin Martin Grötschel
Partners of the Information Portal
• Scientific Libraries, Scientific Archives• Scientific Departments, Research Institutes• Database / Content Providers• Document Delivery Services• Digitization Centers• Scientific Societies• Publishers• Software Houses• Data (Collecting) Centers
Konrad-Zuse-Zentrum für Informationstechnik Berlin Martin Grötschel
Suggestions for an Information Portal
• Open Digital Archives of Specialized Collections• Scientific Suppliers Obtain Free Access• High Quality Information and Services• Robust/Commercial Software/Database• Distributed/Heterogeneous Architecture• Some Centralization is Necessary Too• Emphasis on Reliable/Long Term Availability• Activities in Long Term Archival• Supported by a Specialized Information
Center/Library• Cooperation with Scientific Societies
Not-for-Profit and For-Profit do not exclude each other.
Konrad-Zuse-Zentrum für Informationstechnik Berlin Martin Grötschel
Suggestions for an Organizational Framework
• University Level (local) University Library University Computing Center Cooperation University Media Center
• Scientific Level (topical/national) Specialized Library / Information Center Consulted by a Scientific Society Editorial Topical Competence Center
• National Level National Competence Center for New Technologies Research and Development for Production
Consultation Standardization / Coordination Activities
A Topical Competence Center may be hosted @ Research Institute.
Konrad-Zuse-Zentrum für Informationstechnik Berlin Martin Grötschel
Key Problems
• No progress without substantial investment
• Long term sustainability• No progress without further research and
development• Institutionalization (The IuK-Initiative can literally initiate , but can‘t run the show)
But the show must go on!
Konrad-Zuse-Zentrum für Informationstechnik Berlin Martin Grötschel
Contents
I. All Information is Part of the WebCan we make this true?
II. The Visible Web and the Deep WebIII. There could be an Interconnected
Network of Science IV. Integrating All Types of ResourcesV. We should Organize the Cyber
SpaceVI. To the Benefit of our Society
Konrad-Zuse-Zentrum für Informationstechnik Berlin Martin Grötschel
Who Will Benefit
• Student: Access to Vast Amount of Materials• Employee: Further Training, Lifelong Learning• Teacher: Reuse of High Quality Materials• Author: Publishing Cheap, Fast, and Widely• Publisher: Open Sources Generate New Chances• Business: More Profit from Applying Science• Citizen: Contacting Research More Directly• Science: Communicating with the Public• Society: Free Flow of Information
Konrad-Zuse-Zentrum für Informationstechnik Berlin Martin Grötschel
The End