A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud

A discuss on“Distributed Indexing of Web Scale

Datasets for the Cloud”[1]

Speakers: Vasileios Komianos,

Georgios Tsoumanis,

Eleni Moustaka

Supervisor: Spyridon Sioutas

Ionian University, Dept. of Informatics, Postgraduate

For the course: Advanced Topics in Database Systems

The focus of this presentation is a distributed architecture, from now on called System, for indexing large datasets. Hadoop, MapReduce, HBase and NoSQLDatabases are a few terms used often in this as these are the keystone technologies enabling such tasks.

Why Cloud?

• Cost• Device and Location Independence• Virtualization• Performance• Scalability• Infrastructure as a Service• Platform as a Service• Software as a Service

Why Web Scale?

• Google

• Facebook

• Wikipedia

• Amazon

• Internet Archive

Why Distributed?

• Huge volumes of data

• Computational problems

• Failure tolerance

• Scalability

What Hadoop[2] is

It is a open-source java framework capable of distributed processing of large data sets by using a distributed file system called HDFS[3] and MapReduce[4] model.

Hadoop

HDFS MapReduce

NameNode DataNodes JobTracker TaskTrackers

HadoopArchitecture

Usually NameNode is at the same time JobTracker and DataNodesare also TaskTrackers.

What HBase[5] is

An open-source distributed data store belonging to the known category of NoSQLdatabases. HBase is capable of storing large data sets that can be structured, semi-structured and unstructured offering also rapid query execution.

HBaseArchitecture

HBase

HMaster Region Servers

HBase runs on top of Hadoop and it is modelled after Google’s BitTable[6]. ACIDity* is sacrificed to improve performance and scalability.

*ACID: Atomicity, Consistency, Isolation and Durability

HBasecharacteristics

• NoSQL

• Schema free

• Very large tables

• Scalable

• Sharding

• JSON enable

NoSQLParadigmMongoDB[7]

> db.test.insert({id: "123", name: "Vasileios"})> db.test.insert({id: "123", name: "Vasileios"})> db.test.insert({Presentation: "NoSQL databases"})> db.test.find(){ "_id" : ObjectId("4fbac827f119ef630e74638d"), "id" : "123", "name" : "Vasileios" }{ "_id" : ObjectId("4fbac835f119ef630e74638e"), "id" : "123", "name" : "Vasileios" }{ "_id" : ObjectId("4fbac85df119ef630e74638f"), "Presentation" : "NoSQL databases" }>

MongoDB is an easy to use NoSQL Database, it is free and it is supported by a large community. Suitable if there is no previous NoSQL experience.

NoSQL JSON Schema free

System Architecture

DatasetsUploader

MapReducetask

Content table

IndexerMapReduce

task

Index table

Client API

SearchGetConsisting of: 1 master and

11 worker nodes.

Having: 66 Mappers and 22 Reducers.

Dataset is composed of: 23GB of structured data,300GB of semi-structured data and20GB of unstructured data.

The experiment

The purpose was to test the System’s performance in various conditions such as:

• several datasets sizes,

• different datasets types,

• varying number of nodes,

• different index rules.

Index creation time

TXT dataset is the most demanding of processing when indexed.

5GB HTML dataset index creation time for different index rules

0

2

4

6

8

10

12

1 2 3 4

Iteration No1) 7 indexed tags,2) 14,3) 19,4) 27

Tim

e(m

in)

5GB HTML index size for different index rules

0

0,2

0,4

0,6

0,8

1

1,2

1,4

1 2 3 4

Iteration No:1) 7 tags indexed (table, li, p, b, I, u, title), 2) 14 tags, 3) 19, 4) 27

Ind

ex s

ize

(GB

)

System performance under query load

• Client instances were run concurrently on 14 machines sending queries to the system.

• Types of queries: exact specific attribute,exact any attributerange any attribute.

• Range query loads above 140 queries/sec failed.

• Tests were run with load of 14 queries/sec.

Response time per request:Exact specific queries: 20 ms.Exact any queries: 150ms.Range queries any: 27secs.

References

[1] Ioannis Konstantinou, Evangelos Angelou, Dimitrios Tsoumakos and Nectarios Koziris: Distributed Indexing of Web Scale Datasets for the Cloud. In MDAC ’10, April 26, 2010 Raleigh, NC, USA.

[2] http://hadoop.apache.org/ [3] HDFS Scalability: The limits to growth KV Shvachko - The USENIX

Magazine. v35 i2, 2010 - usenix.org[4] Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data

processing on large clusters. Commun. ACM 51, 1 (January 2008), 107-113.

[5] Ankur Khetrapal, Vinay Ganesh: HBase and Hypertable for large scale distributed storage systems, Dept. of Computer Science, Purdue University

[6] Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M., Chandra, T., Fikes, A., and Gruber, R. E. 2008. Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst. 26, 2, Article 4 (June 2008), 26 pages.

[7] http://www.mongodb.org

Technology

A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud