23
Exploring the Academic Invisible Web Das wissenschaftliche Invisible Web erkunden Dr. Dirk Lewandowski Heinrich-Heine-Universität Düsseldorf, Information Science Research done in collaboration with Philipp Mayr, Bonn

Exploring the Academic Invisible Web Das wissenschaftliche Invisible Web erkunden

  • Upload
    howell

  • View
    68

  • Download
    0

Embed Size (px)

DESCRIPTION

Exploring the Academic Invisible Web Das wissenschaftliche Invisible Web erkunden. Dr. Dirk Lewandowski Heinrich-Heine-Universität Düsseldorf, Information Science Research done in collaboration with Philipp Mayr, Bonn. Agenda. Introduction The (Academic) Invisible Web defined - PowerPoint PPT Presentation

Citation preview

Page 1: Exploring the Academic Invisible Web Das wissenschaftliche Invisible Web erkunden

Exploring the Academic Invisible WebDas wissenschaftliche Invisible Web erkunden

Dr. Dirk LewandowskiHeinrich-Heine-Universität Düsseldorf,

Information Science

Research done in collaboration with Philipp Mayr, Bonn

Page 2: Exploring the Academic Invisible Web Das wissenschaftliche Invisible Web erkunden

Agenda

1. Introduction2. The (Academic) Invisible Web defined3. The size of the (Academic) Invisible Web4. AIW relevant to...5. Opening the AIW – different models

Page 3: Exploring the Academic Invisible Web Das wissenschaftliche Invisible Web erkunden

1 Introduction

• Users expect their search services to be comprehensive and integrated.

• Up-to-dateness and completeness are important factors in research.

Page 4: Exploring the Academic Invisible Web Das wissenschaftliche Invisible Web erkunden

2 The Invisible Web defined

Definitions for Invisible/Deep Web

• “Text pages, files, or other often high-quality authoritative information available via the World Wide Web that general-purpose search engines cannot, due to technical limitations, or will not, due to deliberate choice, add to their indices of Web pages" (Sherman u. Price 2001).

• “The deep Web - those pages do not exist until they are created dynamically as the result of a specific search“ (Bergman 2001).

Page 5: Exploring the Academic Invisible Web Das wissenschaftliche Invisible Web erkunden

Type of Invisible Web Content

Why It's Invisible

Disconnected page No links for crawlers to find the page

Pages consisting primarily of images, audio, or video

Insufficient text for the search engine to "understand" what the page is about

Pages consisting primarily of PDF or Postscript, Flash, Shockwave, Executables (programs) or Compressed files (.zip, .tar, etc.)

Technically indexable, but usually ignored, primarily for business or policy reasons

Content in relational databases

Crawlers can't fill out required fields in interactive forms

Real-time content Ephemeral data; huge quantities; rapidly changing information

Dynamically generated content

Customized content is irrelevant for most searchers; fear of "spider traps"

Sherman u. Price 2001

Page 6: Exploring the Academic Invisible Web Das wissenschaftliche Invisible Web erkunden

From the Invisible Web to the Academic Invisible Web

• Nowadays, the IW problem is mainly the problem with the contents of databases.

• For the academic sector, sources from the surface Web are relevant as well as sources from the Invisible Web.

• The Academic Invisible Web (AIW) consists of the databases relevant to academia.

• Or narrower: The AIW consists of the databases that libraries should index (using search engine technology).

Page 7: Exploring the Academic Invisible Web Das wissenschaftliche Invisible Web erkunden

3 The size of the Invisible Web

Page 8: Exploring the Academic Invisible Web Das wissenschaftliche Invisible Web erkunden

Bergman‘s calculation

• Average size of IW databases:– 5,43 million documents (mean)– 4.950 documents (median)

• Total size:100.000 databases

* 5,43 Mio. documents

= total of 543 billion documents.• Size of the surface Web: 1 billion documents (2001). The Invisible/Deep Web is 550 times larger than the surface

Web.

Page 9: Exploring the Academic Invisible Web Das wissenschaftliche Invisible Web erkunden

Bergman’s calculationBut:• Use of the mean, although

distribution of sizes is highly skewed.– 5,43 million documents

(mean)– 4.950 documents

(median)• Top60 contain 85 billion

documents, 748.504 GB.• Top2 contain 585.400 GB

(>75% of Top60).

Bergman top 60 file sizes

0

50.000

100.000

150.000

200.000

250.000

300.000

350.000

400.000

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59

Size in GB

Page 10: Exploring the Academic Invisible Web Das wissenschaftliche Invisible Web erkunden

Contents of Bergman’s Top 60

Basis: Database sizes in GB

Contents of Bergman's Top 60

Scientific90%

other10%

Contents of Bergman's Top 60

Raw data86%

other10%

Scientific without raw data

4%

Page 11: Exploring the Academic Invisible Web Das wissenschaftliche Invisible Web erkunden

Summary Bergman criticism

• Database selection– Database types– Database content

• Calculation

Page 12: Exploring the Academic Invisible Web Das wissenschaftliche Invisible Web erkunden

Size comparison:Gale Directory of Databases

• Contains approx. 16.000 databases (2003); covers all major academic databases.

• Total size estimate for all databases: 18,55 billion documents (includes CD-ROM databases).

• Estimate is based on less than 10 percent of all databases.

• 5 percent of all databases contain >1 million documents, some more than 100 million.

• Some of the databases included in Bergman’s top 60 are missing in Gale.

Page 13: Exploring the Academic Invisible Web Das wissenschaftliche Invisible Web erkunden

Will AIW show also an exponential distribution?

Dialog File Sizes

0

20.000.000

40.000.000

60.000.000

80.000.000

100.000.000

120.000.000

140.000.000

160.000.000

180.000.000

200.000.000

1 13 25 37 49 61 73 85 97 109 121 133 145 157 169 181 193 205 217 229 241 253 265 277 289 301 313 325 337

Files

Size in records

filesizes

Page 14: Exploring the Academic Invisible Web Das wissenschaftliche Invisible Web erkunden

Will AIW show also an exponential distribution?

Dialog File Sizes

1

10

100

1.000

10.000

100.000

1.000.000

10.000.000

100.000.000

1.000.000.000

1 13 25 37 49 61 73 85 97 109 121 133 145 157 169 181 193 205 217 229 241 253 265 277 289 301 313 325 337

Files

Size in records (log scale)

filesizes

Page 15: Exploring the Academic Invisible Web Das wissenschaftliche Invisible Web erkunden

Conclusion: Size of the Invisible Web

• Bergman’s size of 550 billion documents is highly overestimated.

• An exact calculation from the distribution of Bergman’s top 60 is not possible.

• The size estimate from Gale directory includes databases beyond the web, but does not include all web databases.

• The estimate from Gale is probably too low.

Page 16: Exploring the Academic Invisible Web Das wissenschaftliche Invisible Web erkunden

4 AIW relevant for scholars, searchers, librarians, information professionals

Page 17: Exploring the Academic Invisible Web Das wissenschaftliche Invisible Web erkunden

4 AIW relevant for scholars, searchers, librarians, information professionals

• Everything relevant for the scientific process– Literature (articles, dissertations, reports, books, …)– Data– Pure Online content (e.g. OA)

• Providers of AIW content– Database vendors (meta data) + human indexing– Library content (OPACs, collections) + human indexing– Publishers content (full text) + mixed indexing– Other repositories

• A lot of these materials are not necessarily AIW, but in fact uncovered by the main search engines and tools.

Page 18: Exploring the Academic Invisible Web Das wissenschaftliche Invisible Web erkunden

5 Opening the AIW – different models

• Commercial search engines– Google Scholar– Scirus

• Libraries & database vendors– BASE (Bielefeld Academic Search Engine)– Vascoda (Integration of library and database collections)

• Open Access repositories– Citebase– OpenROAR

Page 19: Exploring the Academic Invisible Web Das wissenschaftliche Invisible Web erkunden

Conclusion

Page 20: Exploring the Academic Invisible Web Das wissenschaftliche Invisible Web erkunden

Summary

• Existing search tools and approaches show potential to make AIW visible

• All protagonists should work together– Commercial search engine providers with their machine

and financing power– Librarians with their experience in collection building and

subject access (e.g. thesauri, classification, taxonomies)– Publishers and database vendors via opening their

collections

Page 21: Exploring the Academic Invisible Web Das wissenschaftliche Invisible Web erkunden

Future research

• Building an AIW sample for further tests.• Better size estimates from this sample.• Classification of AIW content.• Distinction between Academic Surface Web and AIW.

Page 22: Exploring the Academic Invisible Web Das wissenschaftliche Invisible Web erkunden

Vielen Dank.

[email protected]/lewandowski

Page 23: Exploring the Academic Invisible Web Das wissenschaftliche Invisible Web erkunden

References

• Bergman, M.K. (2001). The Deep Web: Surfacing Hidden Value. Journal of Electronic Pub-lishing, 7(1).

• Sherman, C., & Price, G. (2001). The Invisible Web: Uncovering Information Sources Search Engines Can't See. Medford, NJ: Information Today.