Crawling the Web for Structured Documents

Crawling the Web

for Structured DocumentsJulián Urbano, Juan Lloréns, Yorgos Andreadakis and Mónica Marrero

University Carlos III of Madrid · Department of Computer Science

MotivationStructured Information Retrieval is gaining a lot of interest recently

Almost all research is focused just on XML documents, with initiatives like INEX

But what about other types of document like SQL, DTD, Java source code, RDF, UML?

How can we easily gather real-world structured documents off the Web?

And can we use them to develop collections and search engines for specific structured information?

MotivationStructured Information Retrieval is gaining a lot of interest recently

Almost all research is focused just on XML documents, with initiatives like INEX

But what about other types of document like SQL, DTD, Java source code, RDF, UML?

How can we easily gather real-world structured documents off the Web?

And can we use them to develop collections and search engines for specific structured information?

Ask General Purpose

Web Search EnginesFollow Link Patterns in Web Repositories

Type Google (P@20) Yahoo (P@20)

XML 25M (0.85) 238K (0.8)

DTD 48K (0.95) 48K (1)

XSD 134K (1) 181K (1)

SQL 104K (1) 152K (0.95)

JAVA 3M (1) 1.6M (1)

Are there really that many documents?

Not everything is relevant (not SQL)

Se we have to develop filters, because:• Query terms not relevant (comments)

• Many problems with MIME types

• Hierarchical file types (XSD is also XML)

Returns only about 1000 results…

accountbank

deposit

URLs +

additional infoFiles Files. . .

Crawler + cfg.

Sch

ed

ule

r

HTML Processor + cfg.

SQL Processor + cfg.

Java Processor + cfg.

Crawler + cfg.

Sch

ed

ule

r

HTML Processor + cfg.

XML Processor + cfg.

. . .

How It Works

Built for Microsoft .net framework and the free SQL Server Express

Collaborative, multi-computer, multi-threaded with hot plug-in

Core detached from the GUI, can be used programmatically

New file types and meta-data can be added on-the-fly with no effort

What do Processors do?

One processor per file type:1.What additional info we want for these files

(e.g. number of FK definitions, DBMS)

2.Filter files

(e.g. SQL script without table definitions)

3.Process files

(e.g. parse the SQL script and index the table

names, fields and relationships)

Intelligent HTML Processor

Configured per domain:• Discover URLs, to collect in the DB

and download

• Follow URLs, just to navigate through

(no need to download everything)

URL patters defined in terms of:• The actual links in webpages

• The HTML structure of webpages

Highly customizable(see back page)

<CrawlerSettings><CrawlerId><Threads>

<Thread><Priority><UriType><Target>

<Uri>...

<Avoid><Uri>...

<TryAnyUriTypeOnEmpty><TryAnyUriOnEmpty>

...<DatabaseHost><DatabaseName><BatchSize><WaitTimeForUris><DownloadDirectory><DownloadDirectoryDepth><DownloadDirectoryWidth><DownloadDirectoryPerUriType><DownloadDirectoryFullPath><UriTypes>

<Type><Name><CanBeProcessed><ProcessorAssembly><ProcessorFullname><ProcessorConfig>

...<Keywords>

<Uri>...

<Notification><Server><From><To>

<Address>...

<HTMLSettings><SpamWords>

<Word>...

<SpamUris><Uri>...

<UserAgents><String>...

<MaxInMemoryFileSize><DownloadBufferLength><NormalizeUris><UnescapeUris><RemoveAnchors><Domains>

<Domain><Uri><MaxLevels><CheckNoscript><MaxQueueSize><MaxTimeoutWait><MaxDownloadAttempts><MinTimeBetweenRequests><MaxTimeBetweenRequests><MaxRedirections><UseSessions><KeepAlive><IgnoreCertificate><AllowDeflate><AllowGZIP><InLinkFollow>

<Uri>...

<InPageFollow><Uri>...

<InLinkDiscover><Uri>...

<InPageDiscover><Uri>...

...<FileTypes>

<Type><UriTypeName><MinLength><MaxLength><Extensions>

<Extension>...

<MIMETypes><Type>...

...

Create target URLs with patterns and keywords<Keywords>

<Uri><![CDATA[http://www.google.com/search?q=(?<key>+)(?<key>+)(?<key>+)(?<key>+)+%2B"create+table"+filetype:sql&filter=0]]></Uri><Uri><![CDATA[http://sourceforge.net/search/?type_of_search=soft&words=(?<key>+)(?<key>+)(?<key>+)(?<key>+)]]></Uri>

</Keywords>

Get results from Google Search<InPageFollow>

<Uri><![CDATA[<a href="(?<(?<(?<(?<uriuriuriuri>[^"]+)>[^"]+)>[^"]+)>[^"]+)"[^>]+id=pnnext]]></Uri></InPageFollow><InPageDiscover>

<Uri><![CDATA[<h3.+?<a href="(?<(?<(?<(?<uriuriuriuri>[^"]+)>[^"]+)>[^"]+)>[^"]+)".+?</a></h3>]]></Uri></InPageDiscover>

Navigate through Sourceforge’s projects and get project files<InLinkFollow>

<Uri><![CDATA[(?<(?<(?<(?<uriuriuriuri>http://sourceforge.net/projects>http://sourceforge.net/projects>http://sourceforge.net/projects>http://sourceforge.net/projects/[^"]+//[^"]+//[^"]+//[^"]+/downloaddownloaddownloaddownload))))]]></Uri></InLinkFollow><InPageFollow>

<Uri><![CDATA[<a href="(?<(?<(?<(?<uriuriuriuri>[^"]+)>[^"]+)>[^"]+)>[^"]+)">Next →</a>]]></Uri></InPageFollow><InLinkDiscover>

<Uri><![CDATA[(?<(?<(?<(?<uriuriuriuri>http://sourceforge.net/projects/[^/]+/)>http://sourceforge.net/projects/[^/]+/)>http://sourceforge.net/projects/[^/]+/)>http://sourceforge.net/projects/[^/]+/)]]></Uri></InLinkDiscover><InPageDiscover>

<Uri><![CDATA[Please use this <a href="(?<(?<(?<(?<uriuriuriuri>[^"]+)>[^"]+)>[^"]+)>[^"]+)"]]></Uri></InPageDiscover>