22
A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud” [1] Speakers: Vasileios Komianos, Georgios Tsoumanis, Eleni Moustaka Supervisor: Spyridon Sioutas Ionian University, Dept. of Informatics, Postgraduate For the course: Advanced Topics in Database Systems

A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud

  • Upload
    -

  • View
    1.023

  • Download
    0

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud

A discuss on“Distributed Indexing of Web Scale

Datasets for the Cloud”[1]

Speakers: Vasileios Komianos,

Georgios Tsoumanis,

Eleni Moustaka

Supervisor: Spyridon Sioutas

Ionian University, Dept. of Informatics, Postgraduate

For the course: Advanced Topics in Database Systems

Page 2: A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud

The focus of this presentation is a distributed architecture, from now on called System, for indexing large datasets. Hadoop, MapReduce, HBase and NoSQLDatabases are a few terms used often in this as these are the keystone technologies enabling such tasks.

Page 3: A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud

Why Cloud?

• Cost• Device and Location Independence• Virtualization• Performance• Scalability• Infrastructure as a Service• Platform as a Service• Software as a Service

Page 4: A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud

Why Web Scale?

• Google

• Facebook

• Wikipedia

• Amazon

• Internet Archive

Page 5: A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud

Why Distributed?

• Huge volumes of data

• Computational problems

• Failure tolerance

• Scalability

Page 6: A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud

What Hadoop[2] is

It is a open-source java framework capable of distributed processing of large data sets by using a distributed file system called HDFS[3] and MapReduce[4] model.

Page 7: A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud

Hadoop

HDFS MapReduce

NameNode DataNodes JobTracker TaskTrackers

HadoopArchitecture

Usually NameNode is at the same time JobTracker and DataNodesare also TaskTrackers.

Page 8: A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud

What HBase[5] is

An open-source distributed data store belonging to the known category of NoSQLdatabases. HBase is capable of storing large data sets that can be structured, semi-structured and unstructured offering also rapid query execution.

Page 9: A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud

HBaseArchitecture

HBase

HMaster Region Servers

HBase runs on top of Hadoop and it is modelled after Google’s BitTable[6]. ACIDity* is sacrificed to improve performance and scalability.

*ACID: Atomicity, Consistency, Isolation and Durability

Page 10: A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud

HBasecharacteristics

• NoSQL

• Schema free

• Very large tables

• Scalable

• Sharding

• JSON enable

Page 11: A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud

NoSQLParadigmMongoDB[7]

> db.test.insert({id: "123", name: "Vasileios"})> db.test.insert({id: "123", name: "Vasileios"})> db.test.insert({Presentation: "NoSQL databases"})> db.test.find(){ "_id" : ObjectId("4fbac827f119ef630e74638d"), "id" : "123", "name" : "Vasileios" }{ "_id" : ObjectId("4fbac835f119ef630e74638e"), "id" : "123", "name" : "Vasileios" }{ "_id" : ObjectId("4fbac85df119ef630e74638f"), "Presentation" : "NoSQL databases" }>

MongoDB is an easy to use NoSQL Database, it is free and it is supported by a large community. Suitable if there is no previous NoSQL experience.

NoSQL JSON Schema free

Page 12: A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud

System Architecture

DatasetsUploader

MapReducetask

Content table

IndexerMapReduce

task

Index table

Client API

SearchGetConsisting of: 1 master and

11 worker nodes.

Having: 66 Mappers and 22 Reducers.

Dataset is composed of: 23GB of structured data,300GB of semi-structured data and20GB of unstructured data.

Page 13: A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud

The experiment

The purpose was to test the System’s performance in various conditions such as:

• several datasets sizes,

• different datasets types,

• varying number of nodes,

• different index rules.

Page 14: A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud
Page 15: A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud
Page 16: A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud
Page 17: A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud

Index creation time

TXT dataset is the most demanding of processing when indexed.

Page 18: A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud
Page 19: A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud

5GB HTML dataset index creation time for different index rules

0

2

4

6

8

10

12

1 2 3 4

Iteration No1) 7 indexed tags,2) 14,3) 19,4) 27

Tim

e(m

in)

Page 20: A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud

5GB HTML index size for different index rules

0

0,2

0,4

0,6

0,8

1

1,2

1,4

1 2 3 4

Iteration No:1) 7 tags indexed (table, li, p, b, I, u, title), 2) 14 tags, 3) 19, 4) 27

Ind

ex s

ize

(GB

)

Page 21: A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud

System performance under query load

• Client instances were run concurrently on 14 machines sending queries to the system.

• Types of queries: exact specific attribute,exact any attributerange any attribute.

• Range query loads above 140 queries/sec failed.

• Tests were run with load of 14 queries/sec.

Response time per request:Exact specific queries: 20 ms.Exact any queries: 150ms.Range queries any: 27secs.

Page 22: A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud

References

[1] Ioannis Konstantinou, Evangelos Angelou, Dimitrios Tsoumakos and Nectarios Koziris: Distributed Indexing of Web Scale Datasets for the Cloud. In MDAC ’10, April 26, 2010 Raleigh, NC, USA.

[2] http://hadoop.apache.org/ [3] HDFS Scalability: The limits to growth KV Shvachko - The USENIX

Magazine. v35 i2, 2010 - usenix.org[4] Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data

processing on large clusters. Commun. ACM 51, 1 (January 2008), 107-113.

[5] Ankur Khetrapal, Vinay Ganesh: HBase and Hypertable for large scale distributed storage systems, Dept. of Computer Science, Purdue University

[6] Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M., Chandra, T., Fikes, A., and Gruber, R. E. 2008. Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst. 26, 2, Article 4 (June 2008), 26 pages.

[7] http://www.mongodb.org