View
219
Download
3
Category
Preview:
Citation preview
Bux, Mühl Accessing the Deep Web: A SurveyUlf Leser: Text Analytics, Praktikum, Sommersemester 2008
VL Text Analytics
„Accessing the Deep Web: A Survey“
Marc Bux, Tobias Mühl
Bux, Mühl Accessing the Deep Web: A SurveyUlf Leser: Text Analytics, Praktikum, Sommersemester 2008 2 / 31
„Accessing the Deep Web: A Survey“, 2007
byBin He, Mitesh Patel, Zhen Zhang, Kevin ChenChuan Chang
Computer Science DepartmentUniversity of Illinois at UrbanaChampaign
Bux, Mühl Accessing the Deep Web: A SurveyUlf Leser: Text Analytics, Praktikum, Sommersemester 2008 3 / 31
The „Deep Web“
Webinhalte, die nicht durch Suchmachinen indiziert sind.
„While the surface Web has linked billions of static HTML pages, it is believed that a far more significant amount of information is 'hidden' in the deep Web, behind the query forms of searchable databases [...]. Such information may not be accessible through static URL links.“
„Accessing the Deep Web“, He, Patel, Zhang, Chang
Bux, Mühl Accessing the Deep Web: A SurveyUlf Leser: Text Analytics, Praktikum, Sommersemester 2008 4 / 31
The „Deep Web“
Dynamisch generierte Seiten (Forms, Benutzereingaben)
Logingeschützte Seiten
Contextabhängige Seiten
MultimediaSeiten (z.B. Flash)
Bux, Mühl Accessing the Deep Web: A SurveyUlf Leser: Text Analytics, Praktikum, Sommersemester 2008 5 / 31
Bux, Mühl Accessing the Deep Web: A SurveyUlf Leser: Text Analytics, Praktikum, Sommersemester 2008 6 / 31
Bux, Mühl Accessing the Deep Web: A SurveyUlf Leser: Text Analytics, Praktikum, Sommersemester 2008 7 / 31
Bux, Mühl Accessing the Deep Web: A SurveyUlf Leser: Text Analytics, Praktikum, Sommersemester 2008 8 / 31
Bux, Mühl Accessing the Deep Web: A SurveyUlf Leser: Text Analytics, Praktikum, Sommersemester 2008 9 / 31
2000er Studie
Wie groß ist das „Deep Web“?
ca. 43.00096.000 Websites
ca. 7,5 TB Daten
ca. 500fach größer als das „Surface Web“
Bux, Mühl Accessing the Deep Web: A SurveyUlf Leser: Text Analytics, Praktikum, Sommersemester 2008 10 / 31
2000er Studie
Probleme:
Beschränkt sich auf Hochrechnungen bezüglich der Größe des
„Deep Webs“
Benutzt „Overlap Analysis“
Bux, Mühl Accessing the Deep Web: A SurveyUlf Leser: Text Analytics, Praktikum, Sommersemester 2008 11 / 31
2007er Studie
IPSampling Methode
2.230.124.544 mögliche IPAdressen
Nehme zufällige 1.000.000 als repräsentativen Ausschnitt (sample)
Bux, Mühl Accessing the Deep Web: A SurveyUlf Leser: Text Analytics, Praktikum, Sommersemester 2008 12 / 31
IPSampling Methode
Technik:
Sende HTTPRequests an 1.000.000 IPs (GNUTool: wget)
Downloade und analysiere die Webseiten
Erkenne „DeepWebsites“
Bux, Mühl Accessing the Deep Web: A SurveyUlf Leser: Text Analytics, Praktikum, Sommersemester 2008 13 / 31
IPSampling Methode
Erkenne „DeepWebsites“
„ Web server that provides information maintained in one or more
backend Web databases“
Zugriff auf die Datenbanken per Formular
Bux, Mühl Accessing the Deep Web: A SurveyUlf Leser: Text Analytics, Praktikum, Sommersemester 2008 14 / 31
IPSampling Methode
Probleme:
„Virtual Hosting“
Nicht alle Arten an „DeepWebsites“ berücksichtigt
Bux, Mühl Accessing the Deep Web: A SurveyUlf Leser: Text Analytics, Praktikum, Sommersemester 2008 15 / 31
Entrance to the Deep Web
● Entrance is a query interface ≠ login, polling, registration,
message posting and site search
● Depth is the number of operations to get from the root
page to the query interface
Bux, Mühl Accessing the Deep Web: A SurveyUlf Leser: Text Analytics, Praktikum, Sommersemester 2008 16 / 31
Entrance to the Deep Web
● Methods:
− 100.000 of 1.000.000 IP samples
deep crawled to depth 10
● Findings:
− 94% of the web databases
appeared within depth 3
− Query interfaces located shallowly
Bux, Mühl Accessing the Deep Web: A SurveyUlf Leser: Text Analytics, Praktikum, Sommersemester 2008 17 / 31
Scale of the Deep Web
● Methods:
− All 1.000.000 IP samples crawled to depth 3
− Depth 3 sufficicient since Deep Web is located shallowly
● Findings:
− 2256 Web Servers found in total
− 126 Deep Web sites with 190 Web databases and 406 query
interfaces found
Bux, Mühl Accessing the Deep Web: A SurveyUlf Leser: Text Analytics, Praktikum, Sommersemester 2008 18 / 31
Scale of the Deep Web
● Extrapolation:
− 190 * (2.230.124.544 / 1.000.000) / 0,94 ≈
450.000 databases
− In a similar way, 307.000 Deep Web sites and
1.258.000 query interfaces have been estimated
Bux, Mühl Accessing the Deep Web: A SurveyUlf Leser: Text Analytics, Praktikum, Sommersemester 2008 19 / 31
Structure of the Deep Web
● Structured Data – relationally represented in form
of attributevalue pairs (e.g. books on Amazon.com)
● Unstructured Data – no specific order
(e.g. CNN's recent news)
● Surface Web is mostly unstructured (HTML text)
Bux, Mühl Accessing the Deep Web: A SurveyUlf Leser: Text Analytics, Praktikum, Sommersemester 2008 20 / 31
Structure of the Deep Web
● Methods:
− Manual querying and inspection of the 190 found
databases
● Findings:
− 43 unstructured and 147 strucutured databases
● Extrapolation:
− Data in the deep Web is mostly structured (3.4:1 ratio)
Bux, Mühl Accessing the Deep Web: A SurveyUlf Leser: Text Analytics, Praktikum, Sommersemester 2008 21 / 31
Subject Diversity of the Deep Web
● Surface Web consists of >80% commerce sites
● Methods:
− Manual categorization of the 190 found databases
− Taxonomy: 14 toplevel categories of Yahoo.com
● Findings:
− Large diversity of subjects
− Even distribution between commercial and
noncommercial Web databases
Bux, Mühl Accessing the Deep Web: A SurveyUlf Leser: Text Analytics, Praktikum, Sommersemester 2008 22 / 31
Distribution of databases over subject category
Business & EconomyComputers & Internet
News & MediaEntertainment
Recreation & SportsHealth
GovernmentRegional
Society & CultureEducation
Arts & HumanitiesScience
ReferenceOthers
0.00%
5.00%
10.00%
15.00%
20.00%
25.00%
30.00%
Bux, Mühl Accessing the Deep Web: A SurveyUlf Leser: Text Analytics, Praktikum, Sommersemester 2008 23 / 31
Suchmaschinen
Wie gut indizieren google u.a. das Deep Web?
20 „DeepWebsites“
Suche mit google, yahoo und msn
Bux, Mühl Accessing the Deep Web: A SurveyUlf Leser: Text Analytics, Praktikum, Sommersemester 2008 24 / 31
Suchmaschinen
Bux, Mühl Accessing the Deep Web: A SurveyUlf Leser: Text Analytics, Praktikum, Sommersemester 2008 25 / 31
Searching the Deep Web: deepWeb directories
● Online portal services supporting Deep Web database access
− Sort Web databases into different categories
− Enable online search in their categorized databases
Bux, Mühl Accessing the Deep Web: A SurveyUlf Leser: Text Analytics, Praktikum, Sommersemester 2008 26 / 31
Searching the Deep Web: deepWeb directories
● Examples and their number of categorized databases:
− www.completeplanet.com (70.000+)
− www.lii.org (14.000)
− www.turbo10.com (2.300)
− www.invisibleweb.net (1.000)
Bux, Mühl Accessing the Deep Web: A SurveyUlf Leser: Text Analytics, Praktikum, Sommersemester 2008 27 / 31
Searching the Deep Web: deepWeb directories
● Overall coverage is poor (<20%) considered that there are 450.000
Web databases
● Deep Web grows too fast to allow manual categorization
Bux, Mühl Accessing the Deep Web: A SurveyUlf Leser: Text Analytics, Praktikum, Sommersemester 2008 28 / 31
Searching the Deep Web: Future Search Engines
● Traditional Search Engines fail in the Deep Web
− Limitation of crawling (automated search and extraction)
− Databases updated too frequently to be indexed properly
− Search Engines can't exploit the Databases' structure
Bux, Mühl Accessing the Deep Web: A SurveyUlf Leser: Text Analytics, Praktikum, Sommersemester 2008 29 / 31
Searching the Deep Web: Future Search Engines
● Better idea: twotiered Search Engine
● Discovery: automated search for Web databases suiting the query
− Realized by crawling and indexing the databases' query
interfaces
− No information on the databases internal data used
Bux, Mühl Accessing the Deep Web: A SurveyUlf Leser: Text Analytics, Praktikum, Sommersemester 2008 30 / 31
Searching the Deep Web: Future Search Engines
● Forwarding: databasespecific search in the discovered databases
− Using the databases query interface and internal structure
Recommended