Comparison of existing open-source tools for Web crawling and indexing of free Music

JOURNAL OF TELECOMMUNICATIONS, VOLUME 18, ISSUE 1, JANUARY 2013 1

© 2012 JOT www.journaloftelecommunications.co.uk

Comparison of existing open-source tools for Web crawling and indexing of free Music

André Ricardo and Carlos Serrão

Abstract— This paper presents a portrait of existing open-source web crawlers tools that also have an indexing component. The goal is to understand what tool is best suited to crawl and index a large collection of music MP3 files freely available in the Internet. In this study each piece of software is briefly described, with an overview, identification of some users, and their main advantages and disadvantages. In order to better understand the most significant differences between the different tools a resume of features like: programming language in which they are written, the platform used for deployment, the type of index used, database integration, front-end capabilities, existence of a plugin system, MP3 and Adobe Flash (SWF files) parsing support, is presented. Finally the tools were classified according to the prospected collection size, being divided into tools to mirror small collections, medium and large collections with software capable of handling large amounts of data. In conclusion, an assessment on which tools are best suited to handle large collections in a distributed way is made.

Index Terms— Content Analysis and Indexing, Information Storage and Retrieval, Information Filtering, Retrieval Process, Selection Process, Open Source, Creative Commons, Music, MP3.

—————————— u ——————————

1 INTRODUCTIONhe objective of this paper is to identify and study the tools that can be used to create a similar index to the ones used by existing commercial music recommen-

dation systems, but with the purpose of indexing all freely available music in the Internet. The paper is pri-marily focused on the discovery and indexation free mu-sic over the Internet as a way to create a huge distributed database with the capability of offering meta-information and recommendation systems. In the first section of this paper, it will be provided an overview of all the existing tools that can be used for to the purpose of indexing and crawling on the Internet con-sidering the software projects that are open-source (Table 1). Also in this section data is presented about all the most important characteristics of such tools such as program-ming language in which they were developed, the type of index created, database integration, front-end, plugin structure and MP3 and Flash parsing support. Concluding the analysis, for each tool the most relevant key advantages and drawbacks are stated, followed by an overview on how adequate the tool is to solve the prob-lem addressed by this work. Finally some conclusions and future work is presented having into account the major objective of this work: the ability to develop an open and free music recommenda-tion system.

2 TOOLS OVERVIEW This section presents a summary of the different charac-teristics of each of the tools that were considered, and resumed in a set of tables to facilitate the tools compari-son process. First each tool in analysis is introduced with a short description, stating the most notable users operat-ing with each piece of software, and then an overview, advantages and drawbacks. Considering all the software tools in analysis, Table 1 states the programming language used for their devel-opment (language) and the platforms in which they run, if there is some type of indexing done by web crawling tools (index) and finally possible connections to databases are also considered (database).

2.1 ASPSeek The ASPseek tool consists of an indexing robot, a search daemon, and a CGI search frontend. The ASPseek tool (http://www.aspseek.org/) is an outdated tool and its applicability in this scenario it is not a reliable option. The major advantage of this tool is that it supports exter-nal parsers. However, as referred before, the tool is out-dated and cannot scale for global web crawling, since it is based on a relational database.

2.2 Bixo Bixo (http://openbixo.org) is web-mining toolkit that runs as a series of cascading pipes on top of Hadoop (mostly used by companies/services such as Bebo, EMI Music, Share This and Bixo Labs). Bixo is a tool that might be very interesting to projects looking for a web-mining

T

———————————————— • André Ricardo is with ISCTE Instituto Universitário de Lisboa (ISCTE-

IUL), Av. das Forças Armadas, 1649-026 Lisboa, Portugal. • Carlos Serrão is with the ISCTE Instituto Universitário de Lisboa (ISCTE-

IUL), IUL School of Technology and Architecture, Department of Infor-mation Science and Technology (ISTA/DCTI), Av. das Forças Armadas, 1649-026 Lisboa, Portugal.

2

framework that can be integrated with existing infor-mation systems - for example, to inject data into a data-warehouse system. Based on the Cascading API that runs on a Hadoop Clus-ter, Bixo is suitable to crawl large collections. In a project that has the need to handle large collections and to input data into existing systems, Bixo is a tool to have a close look at. Bixo major advantages are its orientation to data mining and its capability to support large sets of data, as it was tested with Public Terabyte Dataset [19][18]. The major drawback of Bixo is its little built-in support to create an index.

2.3 Crawler4J Java Crawler (http://code.google.com/p/crawler4j/) is a tool, which provides a programmable interface for crawl-ing. It is a piece of source code to incorporate in a project but there are more suitable tools to index content. Its main advantage is that it is easy to integrate in Java projects that need a crawling component. On the other hand, it does not offer support for “robots.txt” neither for pages without UTF-8 encoding and it is necessary to cre-ate the entire complementary framework for indexing.

2.4 DataparkSearch DatapakSearch (http://www.dataparksearch.org/) is web crawler and search engine (used, for instance by News Lookup). DataparkSearch is a tool that benefits from MP3 and Flash parser but unfortunately, due to lack of development, it is still using outdated technology like CGI and does not have a modular architecture making it difficult to extent. The index is not in a format that could be used by other frameworks. The major advantage of this tool is that it offers support for MP3 and Flash parser. On the other hand, it still uses outdated technology and its development seems to have stopped.

2.5 Ebot Ebot (http://www.redaelli.org/matteo-blog/projects/ ebot/) is web crawler written on top of Erlang. There is no proof of concept that Ebot would scale well to index the desired collection. Because Erlang and CouchDB were used to solve the crawl and search problem, people keen on these languages might find this tool attractive. There-fore, Ebot is distributed and scalable [8] however there is only one developer active in the project and there is not a proven working system deployed.

2.6 GNU Wget GNU Wget (http://www.gnu.org/software/wget/) is non-interactive command line tool to retrieve files from the most widely used Internet protocols. Wget is a really useful command line tool to download a simple HTML website, but it does not offer indexing support. It is lim-ited to the mirroring and downloading process.

Its main advantage is that with simple commands it is easy to mirror an entire website or to explore the whole site structure. However, there is the need to create the entire indexing infrastructure and it is primarily built for

pages mainly working with HTML with no Flash or Ajax support.

2.7 GRUB GRUB (http://grub.org/) is a web crawler with distrib-uted crawling. GRUB distributed solution requires a proof of concept that is suitable for a large-scale index. It also requires proving that distributed crawling is a better solution than centralized crawling. GRUB tries a new approach to searching by distributing the crawling process. However, the documentation in-complete and it was banned from Wikipedia for bad crawling behavior. According to the Nutch FAQ distrib-uted crawling may not be a good deal, while it saves bandwidth in the long run this saving is not significant. Because it requires more bandwidth to upload query re-sults pages, “making the crawler use less bandwidth does not reduce overall bandwidth requirements. The domi-nant expense of operating a large search engine is not crawling, but searching”. The project development looks to have halted since it lacks news since 2009.

2.8 Heritrix Heritrix (http://crawler.archive.org/) is an extensible, web-scale, archival-quality web crawler project (it is used in the Internet Archive and on “Arquivo da Web Portu-guesa”). Heritrix is the piece of software used and written by “The Internet Archive” to make copies of the Internet. The disadvantage for Heritrix is the lack of indexing ca-pabilities; the content is stored in ARC files [2]. It is a real-ly good solution to archiving websites and makes copies for future reference. The Heritrix software is a use-case proven by Internet Archive that is really adjusted to make copies of websites. However it needs to process Arc files and the architecture is more monolithic and not designed to add parsers and extensibility.

2.9 Ht://Dig ht://Dig (http://www.htdig.org/) is a search Engine and web Crawler. ht://Dig is a searching system towards generating search for a website. Like a website already built in HTML that wants to add searching functionality. Until 2004, date of the last release, it was one of the most popular web crawlers and search engine, enjoying a large user base with notable sites such as the GNU Project and Mozilla Foundation but with no updates over the time, slowly lost most of the user base to newer solutions. ht://Dig was until 2004 one of the most popular web crawlers and search engine, enjoying a large user base with notable sites such as the GNU Project, Mozilla Foundation however, its development has ceased in 2004.

2.10 HTTrack HTTrack (http://www.httrack.com/) is a website mirror tool. HTTrack is designed to create mirrors from existing sites and not for indexing. A good tool for users unfamil-iar with web crawling and that enjoy a good GUI. HTTrack can follow links that are generated with basic JavaScript and inside Applets or Flash [11]. However, HTTrack does not have integration with indexing sys-

3

tems.

2.11 Hyper Estraier Hyper Estraier (http://fallabs.com/hyperestraier/ in-dex.html) is full-text search engine system (used by the GNU Project). Hyper Estraier has some characteristics like high performance search and P2P support making it an interesting solution to add search to an existing web-site. The GNU Project is using Hyper Estraier to search its high number of docs making it a good solution when looking at collections approximately 8 thousands docu-ments in size. Using this tool is useful to add search functionality to a site and it offers P2P support. However it has only one core developer.

2.12 mnoGoSearch mnoGoSearch (http://www.mnogosearch.org/) is a web search engine (one of the users of this too is MySQL). mnoGoSearch is a solution for a small enterprise appli-ance to add search ability to an existing site or intranet. The project is a bit outdated and due to the dependency on a specific vendor other solutions should be considered. One of its major advantages is that MySQL uses it. On the other hand there is little information about scalability and extensibility and it is extremely dependent from the ven-dor Lavtech for future development.

2.13 Nutch Nutch (http://nutch.apache.org/) is a web search, crawl-er, link-graph database, parsers and plugin system (it is used on sites such as Creative Commons and Wikia Search). Nutch is one of the most developed and active projects in the web crawling field. The need to scale and distribute the Nutch, lead to Doug Cutting, the project creator, started developing Hadoop - a framework for reliable, scalable and distributed computing. This means that not only the project is developing itself but it also works with Hadoop, Lucene, Tika and Solr. The project is seeking to integrate other pieces of software such as HBase too [5]. Another strong point for Nutch are the existing deployed systems with published case stud-ies [14] and [16]. The biggest drawback in Nutch is the configuration and tuning process, combined with the need to understand how the crawler works to get the desired results. For large scale web crawling, Nutch is a stable and complete framework. The major advantages of Nutch can be resumed in the following:

• Nutch has a highly modular architecture allow-ing developers to create plugins for the following activities: media-type parsing, data retrieval, querying and clustering [12].

• Nutch works under the Hadoop framework so it features cluster capabilities, distributed computa-tion (using MapReduce) and a distributed filesystem (HDFS) if needed.

• Built in scalability and cost effectiveness in mind [6].

• Support to parse and index a diverse range of

documents using Tika, a toolkit to detect and ex-tract metadata.

• Integrated Creative Commons plugin. • The ability to use other languages such as Python

to script Nutch. • There is an adaption of Nutch called NutchWAX

(Nutch Web Archive eXtensions) allowing Nutch to open ARC files used by Heritrix.

• Top-level Apache project, high level of expertise and visibility around the project.

However Nutch as some complexity and the integrat-ed MP3 parser is deprecated based on “Java ID3 Tag Li-brary” and did not work when tested in Nutch.

2.14 Open Search Server The Open Search Server (http://www.open-search-server.com/) is a search engine with support for business clients. Open Search Server is a good solution for small appliances. Unfortunately it is not well documented in terms of how extensible it is. This tool is quite easy to implement and set it running. However it is dependent on the commercial component for development, has a small community, scarce docu-mentation, has some problems handling special charac-ters and there is little information on extending the soft-ware.

2.15 OpenWebSpider OpenWebApider (http://www.openwebspider.org/) is a web spider for the .NET platform. This is an interesting project based on the .NET framework and in C# pro-gramming for those intended to build a small to middle sized data collection. It supports MP3 indexing and offers crawling and database integration. However it has only one developer, the source is dis-closed but since no one else is working on the project and because there is no source code repository, it is not be-have as a real open source project. The Mono Framework might constitute a problem for those concerned with pa-tent issues, there is no proof of concept and using rela-tional database might not scale well.

2.16 Pavuk Pavuk (http://pavuk.sourceforge.net/) is a web crawler. Pavuk is a complement to tools like Wget, still it does not offer indexing functionality. Its main advantage is that it complements solutions like Wget and HTTrack with fil-ters for regular expressions and functions alike. However, development has stopped since 2007 and has no indexing features.

2.17 Sphider Sphider (http://www.sphider.eu/) is a PHP search en-gine. Sphider is a complete solution with crawler and web search that can run on a server just with PHP and MySQL. To add integrated searching functionality for existing web appliances might be a good solution with little requirements. It is easy to setup and integrate into an existing solution. However the index is a relational database and might not scale well to millions of documents.

4

2.18 Xapian Xapian (http://xapian.org/) is a search engine that uses ht://Dig for crawling. Xapian is a search engine that re-lies on ht://Dig for crawling. If a project has no problem in using CGI and relying on a outdated crawler, but ra-ther puts the effort in having Linux distros packages, then this software can be an option. Xapian major advantages include:

• Xapian “currently indexes over 50 million mail messages” in “gmane” lists proving that it can handle a connection at least that size;

• Scaling to large document collections; • Still in active development; • Packages for some Linux distributions.

The major disadvantages of Xapian include that the index can only be used by Xapian, it uses CGI and is totally de-pendent on ht://Dig for crawling.

2.19 YaCy YaCy (http://yacy.net/) is a free distributed search en-gine, built on principles of peer-to-peer networks (one of the notable users is ScienceNet). For scientific projects like ScienceNet that have several machines across the world with different architectures it can be considered a good solution. YaCy is a distributed search engine working like the P2P model. It is decentralized, even if one node goes down the

search engine continues to work. It is easy to set YaCy working and it is quick to setup a P2P search network. Nevertheless, YaCy is hard to understand how customi-zable is outside the existing parameters and P2P search can be slow according to the Nutch FAQ [9].

3 FEATURES COMPARISON Considering all the different software tools in analysis, Table 1 presents the programming language used for their development and the different platforms in which they run. Also it is considered if there is some type of in-dexing done by web crawling tools (index) and finally if it is possible connections to databases. The following table (Table 2) summarizes for each tool, the following:

• Tool front-end capabilities; • Tool support for plugins and; • MP3 or Adobe Flash parsing support.

Flash support is an important issue to address. Due to the architecture of sites, which use exclusively this technolo-gy without any HTML link structure to navigate or with links to content directly inside Flash files (SWF files), it is necessary to be able to parse this content. The goal is to understand the extensibility, flexibility and maintainability of each of the different solutions consid-ered.

Table 1. Open source tools for web crawling and indexing: programming language, index type and database

Name Language Platform Index Database

Aspseek C++ Linux Relational DB SQL, binary Bixo Java Cross-platform N/A Possible integration crawler4j Java Cross-platform N/A DataparkSearch C Cross-platform SQL MySQL, PostgreSQL Ebot Erlang, NoSQL Linux NoSQL CouchDB GNU Wget C Linux File mirror GRUB C# Cross-platform Relational DB MySQL Heritrix Java Unix Arc files Hounder Java Cross-platform Lucene ht://Dig C++ Unix disk files HTTrack C/C++ Cross-platform Mirror files Hyper Estraier C/C++ Cross-platform QDBM mnoGoSearch C Windows Relational DB MySQL, PostgreSQL, SQLite Nutch Java Cross-platform Lucene Open Search Server C/C++, Java PHP Cross-platform Lucene OpenWebSpider C#, PHP Cross-platform Relational DB MySQL Pavuk C Unix Mirror files Sphider PHP Cross-platform Relational DB MySQL Xapian C++ Cross-platform Omega YaCy Java Cross-platform NoSQL

5

4 OVERVIEW From all the open-source tools considered and analyzed, the ones with recent development reveal the trend also to be the ones where scalability is a core issue. Tools like Bixo, Heritrix, Nutch and Yacy are designed to handle large data collections, as the Web grows bigger. According to each Web crawler tool functionalities and

capabilities they can be grouped into three different cate-gories:

• Mirroring a collection with tools that don't do indexing but only produce integral websites cop-ies;

• Medium collection crawling; • Large collection crawling and indexing.

Table 2. Open source tools for web crawling and indexing: programming language, index type and database

Name Language Platform Index Database

Aspseek CGI external converter programs N/A N/A

Bixo Cascading pipes crawler4j API DataparkSearch N/A External Parsers built-in Via external parsers Ebot Web Services Extensible N/A N/A GNU Wget CLI GRUB PHP Heritrix CLI, JSP Hounder JSP Uses Nutch Plugins ht://Dig CGI External Parsers Via external parsers

HTTrack GUI Follow links

Hyper Estraier CGI API mnoGoSearch CGI, PHP, Perl built-in Nutch CLI, JSP Plugins system deprecated Open Search Server Web based OpenWebSpider CLI, Web based UltraID3Lib Pavuk CLI Sphider PHP External Parsers Xapian CGI, XML Uses Omega N/A N/A YaCy Web based

It is important to note that the distinction between a me-dium and large collection is hard to make. It was consid-ered in this study that a large collection means near whole web crawling (more than 200 million documents) while medium means a subset from the Web (50 to 200 million documents). The differentiation between medium and large was made taking also into account the largest known deployed system (for each tool) because some tools declared to have the ability to perform large web crawling but caressed a proof of concept. Therefore the following classification was made. Mirroring a collection

• GNU Wget, Heritrix, HTTrack, Pavuk Medium collection

• Aspseek, crawler4j, DataparkSearch, Ebot, GRUB, Hounder, ht://Dig, Hyper Estraier, mnoGoSearch, Open Search Server, OpenWeb-Spider, Sphider, Xapian

Large collection

• Bixo, Nutch and Yacy

5 CONCLUSION For most of the situations where only one-enterprise in-tranet or a small specific subset of the web needs to be processed (in this paper referred has medium collection) lighter tools with faster and easier configuration can be sufficient. In this case the range of open-source tools available to make a choice is broad and there is no clear software that is more suitable than others. Programming language and indexing system tend to be the two key factors in choosing the right software for the task and thus the comparison in Table 1. When looking for solutions for large collections, YaCy with a P2P framework is an option. This is interesting software when speed is not crucial, focus goes into a dis-tributed architecture and into an easy setup. To provide reliable, fast and scalable computing Bixo and

6

Nutch are the best answer. This is supported in part be-cause both rely on Hadoop, an industry wide adopted and proven framework, with several success cases such as Yahoo clusters [22] [20], [21], Facebook [10] Last.fm [7] and Spotify using Hadoop [4] [13]. These are just a few examples of organizations using Hadoop. The list is how-ever much more comprehensive [17]. The main difference between them is that Bixo relies on Cascading to complete the workflow and does not do indexing while Nutch indexes using Lucene. In general, solutions using the Lucene index tend to have fast retrieval times and requiring few space on disk (good characteristics for a search engine) in comparison to other solutions [15]. If the choice has to be made between Bixo and Nutch, it depends on the goal to integrate an existing system and workflow in order to do data mining or related jobs, choose Bixo. If it is to build a system with a search engine to handle a massive document collection, Nutch is the tool of choice [1] [3].

REFERENCES [1] Tutorial - T6: IR Prototypes and Web Search Hacks with Open Source

Tools | SIGIR'09. [2] Internet Archive ARC files.

http://crawler.archive.org/articles/developer_manual/arcs.html. Ac-cessed: 09-29-2010.

[3] A Comparison of Open Source Search Engines « Vik Singh. http://zooie.wordpress.com/2009/07/06/a-comparison-of-open-source-search-engines-and-indexing-twitter/. Accessed: 09-26-2010.

[4] Bernhardsson, E. 2009. Implementing a Scalable Music Recommender System.

[5] Bialecki, A. 2009. Nutch, web-scale search engine toolkit. ApacheCon 2009, Oakland.

[6] Cafarella, M. and Cutting, D. 2004. Building nutch: Open source search. Queue. 2, 2 (2004), 61.

[7] Dittus, M. 2008. Hadoop at Last.fm. [8] Ebot | Matteo Redaelli. http://www.redaelli.org/matteo-

blog/projects/ebot/. Accessed: 09-21-2010. [9] FAQ - Nutch Wiki.

http://wiki.apache.org/nutch/FAQ#Will_Nutch_be_a_distributed.2C_P2P-based_search_engine.3F. Accessed: 09-17-2010.

[10] HDFS: Facebook has the world's largest Hadoop cluster! http://hadoopblog.blogspot.com/2010/05/facebook-has-worlds-largest-hadoop.html. Accessed: 09-25-2010.

[11] HTTrack Website Copier - Offline Browser. http://www.httrack.com/html/faq.html. Accessed: 01-09-2011.

[12] Khare, R., Cutting, D. et al. 2004. Nutch: A flexible and scalable open-source web search engine. Oregon State University. (2004).

[13] Kreitz, G. and Niemela, F. 2010. Spotify–Large Scale, Low Latency, P2P Music-on-Demand Streaming. Peer-to-Peer Computing (P2P), 2010 IEEE Tenth International Conference on (2010), 1–10.

[14] Michael, M., Moreira, J.E. et al. 2007. Scale-up x scale-out: A case study using nutch/lucene. IEEE International Parallel and Distributed Pro-cessing Symposium, 2007. IPDPS 2007 (2007), 1–8.

[15] Middleton, C. and Baeza-yates, R. A Comparison of Open Source Search Engines.

[16] Moreira, J.E., Michael, M.M. et al. 2007. Scalability of the Nutch search engine. Proceedings of the 21st annual international conference on Su-percomputing (2007), 12.

[17] PoweredBy - Hadoop Wiki. http://wiki.apache.org/hadoop/PoweredBy. Accessed: 09-27-2010.

[18] Public Terabyte Dataset Project « Elastic Web Mining | Bixo Labs. http://bixolabs.com/datasets/public-terabyte-dataset-project/. Ac-cessed: 09-25-2010.

[19] San Francisco Bay Area ACM , Archive » DM SIG – ACM Silicon Val-ley Data Mining Camp on November 1, 2009. http://www.sfbayacm.org/?p=894. Accessed: 09-25-2010.

[20] Scalability of the Hadoop Distributed File System (Yahoo! Hadoop Blog). http://developer.yahoo.net/blogs/hadoop/2010/05/scalability_of_the_hadoop_dist.html. Accessed: 09-25-2010.

[21] Shvachko, K., Kuang, H. et al. 2010. The Hadoop Distributed File sys-tem. 26th IEEE (MSST2010) Symposium on Massive Storage Systems and Technologies. (2010).

[22] Yahoo! Launches World's Largest Hadoop Production Application (Yahoo! Hadoop Blog). http://developer.yahoo.net/blogs/hadoop/2008/02/yahoo-worlds-largest-production-hadoop.html. Accessed: 09-25-2010.

André Ricardo holds a MSc. in Management and Computer Science (2010) from ISCTE-IUL and a BSc. in Management and Computer Science (2008) also from ISCTE-IUL. Carlos Serrão holds a PhD. in Distributed Systems and Computer Architecture (2008) from UPC, a MSc. in Information Systems Man-agement (2004) from ISCTE-IUL and a BSc. in Management and Computer Science (1997) from ISCTE-IUL. Currently he is an Assis-tant Professor at ISCTE-IUL.

Documents

Comparison of existing open-source tools for Web crawling and indexing of free Music