Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by...

Preview:

Citation preview

Apache Hadoop India Summit 2011

Searching Information

Inside Hadoop Platform

Abinasha KaranaDirector-Technology

Bizosys Technologies Pvt Ltd.

abinash@bizosys.com

www.bizosys.com

Apache Hadoop India Summit 2011

To search a large dataset inside HDFS and HBase, At Bizosys we started with Map-Reduce and Lucene/Solr

Apache Hadoop India Summit 2011

Map-reduce

Result not in a mouse clickWhat didn’t work for us

Apache Hadoop India Summit 2011

It required vertical scaling with manual sharding and subsequent resharding as data grew

Lucene/Solr

What didn’t work for us

Apache Hadoop India Summit 2011

We built a new search for

Hadoop Platform HDFS and HBase

What we did

Apache Hadoop India Summit 2011

my learning from designing, developing

and benchmarking a distributed,

real-time search engine whose Index is

stored and served out of HBase

In the next few slides you will hear about

Apache Hadoop India Summit 2011

Key Learning1. Using SSD is a design decision.

2. Methods to reduce HBase table storage size

3. Serving a request without accessing ALL region servers

4. Methods to move processing near the data

5. Byte block caching to lower network and I/O trips to HBase

6. Configuration to balance network vs. CPU vs. I/O vs memory

Apache Hadoop India Summit 2011

1 Using SSD is a design decision

SSD improved HSearch response time by 66% over SATA. However, SSD is costlier.

In HBase Table Schema Design we considered “Data Access Frequency”, “Data Size” and “Desired Response Time” for selective SSD deployment.

Apache Hadoop India Summit 2011

.. Our SSD Friendly Schema Design

Keyword + Document in 1 Table

Keyword: Reads all for a query.

Document: Reads 10 docs / query.

Keyword + Document in 2 Tables

SSD deployment is All or none SSD deployment is only for Keyword Table

Apache Hadoop India Summit 2011

2 Methods to reduceHBase table storage size

Storing a 4 byte cell requires >27bytes in HBase.

Key len

gth

Valu

e leng

th

Ro

w len

gth

Ro

w B

ytes

Fam

ily Len

gth

Fam

ily Bytes

Qu

alifier Bytes

Tim

estamp

Key Typ

e

Valu

e B

ytes

4 BYTES1 BYTE4 BYTES 4 BYTES 2 BYTES BYTES 1 BYTE BYTES 8 BYTESBYTES

Apache Hadoop India Summit 2011

.. to 1/3rd

• Stored large cell values by merging cells• Reduced the Family name to 1 Character• Reduced the Qualifier name to 1 Character

Apache Hadoop India Summit 2011

3Serving a request without

accessing ALL region servers

Consider a 100 node cluster of HBase and a single

search request need to access all of them.

Bad Design.. Clogged Network.. No scaling

Apache Hadoop India Summit 2011

3 And our solution…

0-1 M

1-2 M

2-3 M

3-4 M

4-5 M

Ro

w R

ang

es

Machine 1

Machine 2

Machine 3

Machine 4

Machine 5

Fam

ily A

Fam

ily B

Scan “Family A” - 5 Machines HitScan Table A - 3 Machines Hit

Machine 1

Machine 2

Machine 3 Machine 3

Machine 4

Machine 5

Table A Table B

Index Table was divided on Column-Family as separate tables

Apache Hadoop India Summit 2011

4Methods to move

processing near the data

public class TermFilter implements Filter {public ReturnCode filterKeyValue(KeyValue kv) {

boolean isMatched = isFound(kv);if (isMatched ) return ReturnCode.INCLUDE;return ReturnCode.NEXT_ROW;

}…

• Sent filtered Rows over network.

public class DocFilter implements Filter {public void filterRow(List<KeyValue> kvL) {

byte[] val = extractNeededPiece(kvL);kvL.clear();kvL.add(new KeyValue(row,fam,,val));

}….

• Sent relevant section of a Field over network.

E.g. Matched rows for a keyword

E.g. Computing a best match section from within a document for a given query

• Sent relevant Fields of a Row over network.

Apache Hadoop India Summit 2011

5Byte block caching to lower

network and I/O trips to HBase

• Object caching – With growing number of objects

we encountered ‘Out of Memory’ exception

• HBase commit - Frequent flushing to HBase

introduced network and I/O latencies.

Converting Objects to intermediate Byte Blocks

increased record processing by 20x in 1 batch.

Apache Hadoop India Summit 2011

6Configuration to balance Network vs. CPU vs. I/O vs. Memory

DiskI/O

Network

Memory CPU

Block Caching

IPC Caching

Compression

Compression

Aggressive GC

In a Single Machine

Apache Hadoop India Summit 2011

… and it’s settings

• Network– Increased IPC Cache Limits (hbase.client.scanner.caching)

• CPU– JVM agressive heap

("-server -XX:+UseParallelGC -XX:ParallelGCThreads=4 XX:+AggressiveHeap “)

• I/O– LZO index compression (“Inbuilt oberhumer LZO” or “Intel IPP native LZO”)

• Memory– HBase block caching (hfile.block.cache.size) and overall memory

allocation for data-node and region-server.

Apache Hadoop India Summit 2011

.. and parallelized to multi-machines

•HTable.batch (Get, Put, Deletes)•ParallelHTable (Scans)•FUTURE-coprocessors (hbase 0.92 release).

Allocating appropriate resources dfs.datanode.max.xcievers, hbase.regionserver.handler.count and dfs.datanode.handler.count

Apache Hadoop India Summit 2011

HSearch Benchmarks on AWS

• Amazon Large instance 7.5 GB Memory * 11 Machines

with a single 7.5K SATA drive

• 100 Million Wikipedia pages of total 270GB and

completely indexed (Included common stopwords)  

• 10 Million pages repeated 10 times. (Total indexing time

is 5 Hours)

• Search Query Response speed using a – regular word is 1.5 sec

– common word such as “hill” found 1.6 million matches and

sorted in 7 seconds

Apache Hadoop India Summit 2011

www.sourceforge.net/bizosyshsearch

• Apache-licensed (2 versions released)

• Distributed, real-time search

• Supports XML documents with rich search syntax

and various filtration criteria such as document

type, field type.

Apache Hadoop India Summit 2011

References• Initial performance reports (Bizosys HSearch, a Nosql

search engine, featured in Intel Cloud Builders Success Stories (July, 2010)

• HSearch is currently in use at http://www.10screens.com

• More on hsearch– http://www.bizosys.com/blog/– http://bizosyshsearch.sourceforge.net/

• More on SSD Product Technical Specifications http://download.intel.com/design/flash/nand/extreme/extreme-sata-ssd-product-brief.pdf

Recommended