Upload
-
View
1.023
Download
0
Embed Size (px)
DESCRIPTION
Citation preview
A discuss on“Distributed Indexing of Web Scale
Datasets for the Cloud”[1]
Speakers: Vasileios Komianos,
Georgios Tsoumanis,
Eleni Moustaka
Supervisor: Spyridon Sioutas
Ionian University, Dept. of Informatics, Postgraduate
For the course: Advanced Topics in Database Systems
The focus of this presentation is a distributed architecture, from now on called System, for indexing large datasets. Hadoop, MapReduce, HBase and NoSQLDatabases are a few terms used often in this as these are the keystone technologies enabling such tasks.
Why Cloud?
• Cost• Device and Location Independence• Virtualization• Performance• Scalability• Infrastructure as a Service• Platform as a Service• Software as a Service
Why Web Scale?
• Wikipedia
• Amazon
• Internet Archive
Why Distributed?
• Huge volumes of data
• Computational problems
• Failure tolerance
• Scalability
What Hadoop[2] is
It is a open-source java framework capable of distributed processing of large data sets by using a distributed file system called HDFS[3] and MapReduce[4] model.
Hadoop
HDFS MapReduce
NameNode DataNodes JobTracker TaskTrackers
HadoopArchitecture
Usually NameNode is at the same time JobTracker and DataNodesare also TaskTrackers.
What HBase[5] is
An open-source distributed data store belonging to the known category of NoSQLdatabases. HBase is capable of storing large data sets that can be structured, semi-structured and unstructured offering also rapid query execution.
HBaseArchitecture
HBase
HMaster Region Servers
HBase runs on top of Hadoop and it is modelled after Google’s BitTable[6]. ACIDity* is sacrificed to improve performance and scalability.
*ACID: Atomicity, Consistency, Isolation and Durability
HBasecharacteristics
• NoSQL
• Schema free
• Very large tables
• Scalable
• Sharding
• JSON enable
NoSQLParadigmMongoDB[7]
> db.test.insert({id: "123", name: "Vasileios"})> db.test.insert({id: "123", name: "Vasileios"})> db.test.insert({Presentation: "NoSQL databases"})> db.test.find(){ "_id" : ObjectId("4fbac827f119ef630e74638d"), "id" : "123", "name" : "Vasileios" }{ "_id" : ObjectId("4fbac835f119ef630e74638e"), "id" : "123", "name" : "Vasileios" }{ "_id" : ObjectId("4fbac85df119ef630e74638f"), "Presentation" : "NoSQL databases" }>
MongoDB is an easy to use NoSQL Database, it is free and it is supported by a large community. Suitable if there is no previous NoSQL experience.
NoSQL JSON Schema free
System Architecture
DatasetsUploader
MapReducetask
Content table
IndexerMapReduce
task
Index table
Client API
SearchGetConsisting of: 1 master and
11 worker nodes.
Having: 66 Mappers and 22 Reducers.
Dataset is composed of: 23GB of structured data,300GB of semi-structured data and20GB of unstructured data.
The experiment
The purpose was to test the System’s performance in various conditions such as:
• several datasets sizes,
• different datasets types,
• varying number of nodes,
• different index rules.
Index creation time
TXT dataset is the most demanding of processing when indexed.
5GB HTML dataset index creation time for different index rules
0
2
4
6
8
10
12
1 2 3 4
Iteration No1) 7 indexed tags,2) 14,3) 19,4) 27
Tim
e(m
in)
5GB HTML index size for different index rules
0
0,2
0,4
0,6
0,8
1
1,2
1,4
1 2 3 4
Iteration No:1) 7 tags indexed (table, li, p, b, I, u, title), 2) 14 tags, 3) 19, 4) 27
Ind
ex s
ize
(GB
)
System performance under query load
• Client instances were run concurrently on 14 machines sending queries to the system.
• Types of queries: exact specific attribute,exact any attributerange any attribute.
• Range query loads above 140 queries/sec failed.
• Tests were run with load of 14 queries/sec.
Response time per request:Exact specific queries: 20 ms.Exact any queries: 150ms.Range queries any: 27secs.
References
[1] Ioannis Konstantinou, Evangelos Angelou, Dimitrios Tsoumakos and Nectarios Koziris: Distributed Indexing of Web Scale Datasets for the Cloud. In MDAC ’10, April 26, 2010 Raleigh, NC, USA.
[2] http://hadoop.apache.org/ [3] HDFS Scalability: The limits to growth KV Shvachko - The USENIX
Magazine. v35 i2, 2010 - usenix.org[4] Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data
processing on large clusters. Commun. ACM 51, 1 (January 2008), 107-113.
[5] Ankur Khetrapal, Vinay Ganesh: HBase and Hypertable for large scale distributed storage systems, Dept. of Computer Science, Purdue University
[6] Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M., Chandra, T., Fikes, A., and Gruber, R. E. 2008. Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst. 26, 2, Article 4 (June 2008), 26 pages.
[7] http://www.mongodb.org