12
Performance Evaluation and Optimization of Hybrid Indexes in Hive 1 G.Sravanthi, 2 C.V.S.Satyamurty 1,2 Computer Science of Engineering, Information Technology 1,2 CVR College of Engineering, Hyderabad, India E-mail:[email protected], E-mail: [email protected] Abstract: Reducing risks & reduced costs for the business, Big data technologies are providing important and accurate analysis, which leads to concrete decision-making results in grater operational efficiencies. An infrastructure would be required to obtain all the features of bid data, for managing and processing huge volumes of indexed and non-indexed data in real-time and to protect data privacy and security. Many Vendors like Amazon, IBDM, Microsoft etc products in these technologies are available to handle big data are categorized into two Operational Big Data and Analytical Big Data. Hive a data warehouse tool for easy querying and analyzing processes structured data in Hadoop which resides on top of Hadoop to recapitulate Big Data. Hive supports Compact and Bitmap indexing, to improve the query performance. These indexes are used individually on Hive; we have implemented both the indexes on the same partition in Hive, and got good performance optimization. In this paper text data is used for analyzing the Simultaneous combo Indexes (Compact & Bitmap) on the partition had resulted in good performance Optimization. Keywords: Hive, Hadoop, Indexing, Compact Indexing, Hadoop infrastructure, Big Data, HiveQL. 1. INTRODUCTION Enterprise Organizations are using data to avail better decisions making to enhance their performances. The decade of 21 st century showcased the fast changes in the availability of data and its efficacy of the Organization. This has also revolutionize the usage of data and brought the concept of Big Data [1] Big Data Availability of huge amount of data where storing, processing and mining or extracting is a difficult using traditional storage [2]. These concepts have been the reasons embraced Online Firms (Google, eBay, Facebook, LinkedIn etc). Big Data for Small and Big Companies There are many reasons that Why Big Data was first accepted by Online Firms. These Enterprise and startups go around the concept of using rapidly challenging data and facing the issues of integrating unstructured data with the existing techniques [3]. The following are the issues or challenges regarding Big-Data are as follows. Volume: Huge amount of data available made a challenge as it can be neither possible nor efficient to mange using traditional databases. International Journal of Research Volume 7, Issue IX, September/2018 ISSN NO:2236-6124 Page No:160

Performance Evaluation and Optimization of Hybrid Indexes ...ijrpublisher.com/gallery/19-september-482.pdf · Keywords: Hive, Hadoop, Indexing, Compact Indexing, Hadoop infrastructure,

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Performance Evaluation and Optimization of Hybrid Indexes ...ijrpublisher.com/gallery/19-september-482.pdf · Keywords: Hive, Hadoop, Indexing, Compact Indexing, Hadoop infrastructure,

Performance Evaluation and Optimization of Hybrid Indexes in Hive

1 G.Sravanthi, 2 C.V.S.Satyamurty

1,2 Computer Science of Engineering, Information Technology

1,2 CVR College of Engineering, Hyderabad, India

E-mail:[email protected], E-mail: [email protected]

Abstract: Reducing risks & reduced costs for the business, Big data technologies are providing important

and accurate analysis, which leads to concrete decision-making results in grater operational efficiencies. An

infrastructure would be required to obtain all the features of bid data, for managing and processing huge

volumes of indexed and non-indexed data in real-time and to protect data privacy and security. Many Vendors

like Amazon, IBDM, Microsoft etc products in these technologies are available to handle big data are

categorized into two Operational Big Data and Analytical Big Data. Hive a data warehouse tool for easy

querying and analyzing processes structured data in Hadoop which resides on top of Hadoop to recapitulate Big

Data. Hive supports Compact and Bitmap indexing, to improve the query performance. These indexes are used

individually on Hive; we have implemented both the indexes on the same partition in Hive, and got good

performance optimization. In this paper text data is used for analyzing the Simultaneous combo Indexes

(Compact & Bitmap) on the partition had resulted in good performance Optimization.

Keywords: Hive, Hadoop, Indexing, Compact Indexing, Hadoop infrastructure, Big Data, HiveQL.

1. INTRODUCTION

Enterprise Organizations are using data to avail better decisions making to enhance their performances. The

decade of 21st century showcased the fast changes in the availability of data and its efficacy of the Organization.

This has also revolutionize the usage of data and brought the concept of Big Data [1]

Big Data

Availability of huge amount of data where storing, processing and mining or extracting is a difficult

using traditional storage [2]. These concepts have been the reasons embraced Online Firms (Google, eBay,

Facebook, LinkedIn etc).

Big Data for Small and Big Companies

There are many reasons that Why Big Data was first accepted by Online Firms. These Enterprise and

startups go around the concept of using rapidly challenging data and facing the issues of integrating unstructured

data with the existing techniques [3]. The following are the issues or challenges regarding Big-Data are as

follows.

Volume: Huge amount of data available made a challenge as it can be neither possible nor efficient to mange

using traditional databases.

International Journal of Research

Volume 7, Issue IX, September/2018

ISSN NO:2236-6124

Page No:160

Page 2: Performance Evaluation and Optimization of Hybrid Indexes ...ijrpublisher.com/gallery/19-september-482.pdf · Keywords: Hive, Hadoop, Indexing, Compact Indexing, Hadoop infrastructure,

Variety: In comparison to earlier text and table format, current available versions are in the form of pictures,

videos and tweets etc.

Velocity: Increased usage of Online Space and that the data was available was rapidly changing and have to be

made available instantly and effectively [4].

Open Source Tool for Big Data Analytics

A distributed software solution Hadoop is a scalable fault tolerance for data storage, process, and

extracting are the main uses.

i. HDFS (which is storage)

ii. Map Reduce (Fig. 1)

Fig1: Hadoop HDFS, Map Reduce Logical View

Fig2: The Hadoop technology stack

2. Related Word

Hive

Data warehouse for OLAP workloads to handle and query is the Hive over vast volume of data in

distributed storage. The Hadoop Distributed File System (HDFS)[5] is the ecosystem which it maintains huge

volumes of data reliably from hardware faults [6].

International Journal of Research

Volume 7, Issue IX, September/2018

ISSN NO:2236-6124

Page No:161

Page 3: Performance Evaluation and Optimization of Hybrid Indexes ...ijrpublisher.com/gallery/19-september-482.pdf · Keywords: Hive, Hadoop, Indexing, Compact Indexing, Hadoop infrastructure,

Fig3: Layered Architecture of Big Data System

The SQL-like relational data warehousing approach Hive is developed on top of Hadoop and a massive

computation at a bandwidth in Hadoop enabling data streams called MapReduce is a high-level programming

model.

HiveQL

It is a SQL-like query language, is used to query over Hive. Hive does not support SQL Commands

(Update, Delete, and Insert at Row-Level) and transactions also. Hadoop Client Hive is to function with rare

update and batch-mode data insertions are the characteristics of the Hive as a underlying system. Hive an OLAP

data For Online Transaction Processing (OLTP) features with big data, one should consider a NoSQL database

[7].

Data in Hive

Hive Data Types

Hive common relational databases data types as well as the three collection types: STRUCT, MAP, and

ARRAY.

Hive File Formats

Hive can process data coming from different data sources using Extract, Transform and Load (ETL)

tools for reading data from different file format and their customization.

Index Construction in Hive

CREATE INDEX index_name

ON TABLE base_table_name (col_name, ...)

AS 'index.handler.class.name'

International Journal of Research

Volume 7, Issue IX, September/2018

ISSN NO:2236-6124

Page No:162

Page 4: Performance Evaluation and Optimization of Hybrid Indexes ...ijrpublisher.com/gallery/19-september-482.pdf · Keywords: Hive, Hadoop, Indexing, Compact Indexing, Hadoop infrastructure,

WITH DEFERRED REBUILD

[IDXPROPERTIES (property_name=property_value, ...)]

[IN TABLE index_table_name]

[PARTITIONED BY (col_name, ...)]

[[ROW FORMAT ...] STORED AS ...| STORED BY ...]

[LOCATION hdfs_path]

[TBLPROPERTIES (...)]

[COMMENT "index comment"]

There are some parameters and keywords used in the command listing:

1. index_name: The index name given by the user used to access the index by the user itself. This name is used to

refer to the index in ALTER INDEX and DROP INDEX commands.

2. base_table_name (col_name,...): The table over which the index is to be created using the desired columns

listed.

3. 'index.handler.class.name': this specifies the type of the index, which could be bitmap or compact values.

4. index_table_name: The index name given by Hive (the default name) or the user, used to access the index as a

table by the user or Hive. For example the user can see the content of an index using this parameter. Any other

behavior of the index as a table can be addressed using this name.

5. WITH DEFERRED REBUILD: This means the index starts empty. Index will be populated using:

ALTER INDEX index_name ON table_name [PARTITION (...)] REBUILD

6. IDXPROPERTIES: This gives the properties of the index. For example: 'index_creator' = 'Mahsa' can be

considered a property for an index.

7. IN TABLE index_table_name: This is used when the user wishes to build an index in a different table from

an already built index (if any), to keep them separate.

Hive Architecture

Fig4: Apache Hive Architecture

International Journal of Research

Volume 7, Issue IX, September/2018

ISSN NO:2236-6124

Page No:163

Page 5: Performance Evaluation and Optimization of Hybrid Indexes ...ijrpublisher.com/gallery/19-september-482.pdf · Keywords: Hive, Hadoop, Indexing, Compact Indexing, Hadoop infrastructure,

Major components of the Apache Hive are:

1. Metastore

2. Driver

3. Compiler

4. Optimizer

5. Executor CLI, UI, and Thrift Server

Hive supports three kinds of indexes, Compact Index, Aggregate Index, and Bitmap index. These are

tables in Hive containing every index dimension and array of location of associated Index. Number of index

dimensions increases table becomes very huge which occupies large amount of disk space.

The aim of creating INDEX is to improve the performance of data retrieval. For Instance, let us say you

are querying with filter condition WHERE column_name = 100, without index will load entire table or partition

to process records and with index on column_name would load part of file to process the records.

Compact Index on Hive Table

Pair Indexed Column’s value and its bolckid of Index Column are stored in Compact indexing.

CREATE COMPACT INDEX:

CREATE INDEX index_name

ON TABLE base_table_name (col_name, ...)

AS index_type

[WITH DEFERRED REBUILD]

[IDXPROPERTIES (property_name=property_value, ...)]

[IN TABLE index_table_name]

[ [ ROW FORMAT ...] STORED AS ...

| STORED BY ... ]

[LOCATION hdfs_path]

[TBLPROPERTIES (...)]

[COMMENT "index comment"];

Example

hive> CREATE INDEX index_students ON TABLE students(id)

>AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'

> WITH DEFERRED REBUILD;

OK

Time taken: 0.493 seconds

ALTER COMPACT INDEX

ALTER INDEX index_name ON table_name [PARTITION partition_spec] REBUILD;

International Journal of Research

Volume 7, Issue IX, September/2018

ISSN NO:2236-6124

Page No:164

Page 6: Performance Evaluation and Optimization of Hybrid Indexes ...ijrpublisher.com/gallery/19-september-482.pdf · Keywords: Hive, Hadoop, Indexing, Compact Indexing, Hadoop infrastructure,

Example

hive> ALTER INDEX index_students ON students REBUILD;

Query ID = cloudera_20171111093030_6a37b92b-bae8-4fd1-91bb-d13e9d411513

Total jobs = 1

...

Total MapReduce CPU Time Spent: 4 seconds 180 msec

OK

DROP COMPACT INDEX

DROP INDEX [IF EXISTS] index_name ON table_name;

Example

hive> DROP INDEX IF EXISTS index_students ON students;

OK

Time taken: 0.27 seconds

Bitmap Index on Hive Table

Combining indexed column value and list of rows are stored as bitmap in Bitmap indexing.

CREATE BITMAP INDEX:

CREATE INDEX olympic_index_bitmap

ON TABLE olympic (age)

AS 'BITMAP'

WITH DEFERRED REBUILD;

ALTER INDEX olympic_index_bitmap on olympic REBUILD;

Indexing Advantages

1. Indexes improve the query performance.

2. Multiple Indexes can be created on the same table.

3. Any type of indexing can be created on the data.

4. Depending on data Bitmap indexes are faster than Compact indexes and vice versa.

3. Preliminaries and Definitions

HDFS: Hadoop Distributed File System

YARN: Yet Another Resource Negotiator

DN: Data Node

NN: Name Node

SN: Secondary Name Node

RM: Resource Manager

NM: Node Manager

International Journal of Research

Volume 7, Issue IX, September/2018

ISSN NO:2236-6124

Page No:165

Page 7: Performance Evaluation and Optimization of Hybrid Indexes ...ijrpublisher.com/gallery/19-september-482.pdf · Keywords: Hive, Hadoop, Indexing, Compact Indexing, Hadoop infrastructure,

CLI: Command Line Interface

UI: User Interface

OLTP: Online Transaction Processing

OLAP: Online Analytical Processing

4. System Model

In general the indexes (Compact Index and Bitmap Index) are applied as a separate entity for performance

optimization. In the proposed model both the indexes are applied together and tested on numerous data values

which resulted in more performance optimization.

Fig 5: applying both Indexes

5. Performance Analysis

Table1: Hive Query statistics without indexing (all the results are in seconds)

Rows/Tables Table 1 Table 2 Table 3 Table 4

50 0.628 0.1 0.085 0.167

100 0.14 0.17 0.087 0.17

150 0.137 0.07 0.0204 0.098

500 0.139 0.181 0.093 0.109

1000 0.251 0.074 0.159 0.108

5000 0.161 0.125 0.089 0.196

6000 0.143 0.225 0.089 0.117

10000 0.162 0.092 0.17 0.097

11000 0.145 0.104 0.085 0.07

11500 0.14 0.098 0.086 0.176

Hive

Compact Bitmap

International Journal of Research

Volume 7, Issue IX, September/2018

ISSN NO:2236-6124

Page No:166

Page 8: Performance Evaluation and Optimization of Hybrid Indexes ...ijrpublisher.com/gallery/19-september-482.pdf · Keywords: Hive, Hadoop, Indexing, Compact Indexing, Hadoop infrastructure,

Chart1: Hive Query statistics without indexing (all the results are in seconds)

Table2: Hive Query statistics with Bitmap Indexing

Rows/Tables Table 1 Table 2 Table 3 Table 4

50 0.129 0.935 0.085 0.086

100 0.157 0.104 0.095 0.182

150 0.104 0.09 0.181 0.096

500 0.139 0.098 0.154 0.084

1000 0.158 0.097 0.196 0.089

5000 0.123 0.175 0.173 0.096

6000 0.113 0.175 0.081 0.124

10000 0.144 0.108 0.082 0.08

11000 0.169 0.192 0.191 0.089

11500 0.121 0.096 0.095 0.185

Chart2: Hive Query statistics with Bitmap Indexing

International Journal of Research

Volume 7, Issue IX, September/2018

ISSN NO:2236-6124

Page No:167

Page 9: Performance Evaluation and Optimization of Hybrid Indexes ...ijrpublisher.com/gallery/19-september-482.pdf · Keywords: Hive, Hadoop, Indexing, Compact Indexing, Hadoop infrastructure,

Table3: Hive Query with Compact Indexing

Rows/Tables Table 1 Table 2 Table 3 Table 4

50 0.117 0.092 0.215 0.097

100 0.121 0.265 0.09 0.077

150 0.106 0.085 0.105 0.072

500 0.117 0.09 0.17 0.161

1000 0.101 0.161 0.081 0.082

5000 0.184 0.173 0.17 0.075

6000 0.135 0.144 0.099 0.085

10000 0.18 0.178 0.163 0.175

11000 0.164 0.11 0.11 0.068

11500 0.212 0.092 0.09 0.144

Chart3: Hive Query with Compact Indexing

Table4: Hive Query with Both Bitmap and Compact Indexing together

Rows/Tables Table 1 Table 2 Table 3 Table 4

50 0.098 0.101 0.097 0.098

100 0.134 0.092 0.099 0.173

150 0.112 0.082 0.107 0.071

500 0.114 0.17 0.091 0.083

1000 0.106 0.08 0.173 0.083

5000 0.094 0.084 0.083 0.079

6000 0.242 0.116 0.105 0.094

10000 0.198 0.1 0.077 0.107

11000 0.185 0.09 0.148 0.174

11500 0.104 0.185 0.202 0.099

International Journal of Research

Volume 7, Issue IX, September/2018

ISSN NO:2236-6124

Page No:168

Page 10: Performance Evaluation and Optimization of Hybrid Indexes ...ijrpublisher.com/gallery/19-september-482.pdf · Keywords: Hive, Hadoop, Indexing, Compact Indexing, Hadoop infrastructure,

Note (Line Chart Legend):

Series 1: Table 1

Series 2: Table 2

Series 3: Table 3

Series 4: Table 4

X – Axis Row in 1000 Y-Axis Time in Seconds

Chart4: Hive Query with Both Bitmap and Compact Indexing together

Table5: Average Hive Query with no index, Bitmap, Compact and Both Bitmap and Compact Indexing

together

Rows/Index No Index Bitmap Compact Both

50 0.245 0.30875 0.13025 0.0985

100 0.14175 0.1345 0.13825 0.1245

150 0.08135 0.11775 0.092 0.093

500 0.1305 0.11875 0.1345 0.1145

1000 0.148 0.135 0.10625 0.1105

5000 0.14275 0.14175 0.1505 0.085

6000 0.1435 0.12325 0.11575 0.13925

10000 0.13025 0.1035 0.174 0.1205

11000 0.101 0.16025 0.113 0.14925

11500 0.125 0.12425 0.1345 0.1475

International Journal of Research

Volume 7, Issue IX, September/2018

ISSN NO:2236-6124

Page No:169

Page 11: Performance Evaluation and Optimization of Hybrid Indexes ...ijrpublisher.com/gallery/19-september-482.pdf · Keywords: Hive, Hadoop, Indexing, Compact Indexing, Hadoop infrastructure,

Chat5: Average Hive Query with no index, Bitmap, Compact and Both Bitmap

and Compact Indexing together

Chart6: (Bar chart) Average Hive Query with no index, Bitmap, Compact and Both Bitmap and Compact

Indexing together

Note (Bar Chart Legend):

Series 1: No Index

Series 2: Bitmap Index

Series 3: Compact Index

Series 4: Both Index

X – Count of Values Y-Axis Time in Seconds

6. Conclusion and Future Scope

Hence in we conclude that Individually Compact and Bitmap indexes have a good performance in data

analysis. In our implementation it has proved that applying both the indexes in Hive Partition had given the best

performance in analyzing single dimensional data. Further these have to be analyzed on audio, video, and image

type data to be analyzed with multi dimensional index on apache spark environment.

International Journal of Research

Volume 7, Issue IX, September/2018

ISSN NO:2236-6124

Page No:170

Page 12: Performance Evaluation and Optimization of Hybrid Indexes ...ijrpublisher.com/gallery/19-september-482.pdf · Keywords: Hive, Hadoop, Indexing, Compact Indexing, Hadoop infrastructure,

REFERENCES

[1] M. A. Beyer and D. Laney, “The importance of ‟big data‟: A definition,” Gartner, Tech. Rep., 2012..

[2] X. Wu, X. Zhu, G. Q. Wu, et al., “Data mining with big data,” IEEE Trans. on Knowledge and Data Engineering, vol.

26, no. 1, pp. 97-107, January 2014.Rajaraman and J. D. Ullman, “Mining of massive datasets,” Cambridge University Press,

2012.

[3] Z. Zheng, J. Zhu, M. R. Lyu. “Service-generated Big Data and Big Data-as-a-Service: An Overview,” in Proc. IEEE

BigData, pp. 403-410, October 2013. A . Bellogín, I. Cantador, F. Díez, et al., “An empirical comparison of social,

collaborative filtering, and hybrid recommenders,” ACM Trans. on Intelligent Systems and Technology, vol. 4, no. 1, pp. 1-37,

January 2013.

[4] W. Zeng, M. S. Shang, Q. M. Zhang, et al., “Can Dissimilar Users Contribute to Accuracy and Diversity of Personalized

Recommendation?,” International Journal of Modern Physics C, vol. 21, no. 10, pp. 1217-1227, June 2010.

[5] Apache Hadoop [Online]. Available: http://hadoop.apache.org/

[6] Chansler, R., Kuang, H., Radia, S., Shvachko, K. ”The Hadoop Distributed File System”, in Proc. IEEE Conf. Mass

Storage Systems and Technologies (MSST), Incline Village, NV, 2010, pp. 1 – 10

[7] Apache Hbase [Online]. Available: http://hbase.apache.org/

[8] A.Abouzeid, K.Bajda-Pawlikowski, D. Abadi, A. Silberschatz,and A. Rasin. Hadoopdb: an architectural hybrid of

mapreducen and dbms technologies for analytical workloads. Proceedings of the VLDB Endowment, 2(1):922–933, 2009.

[9] J. Dittrich, J.-A. Quian´e-Ruiz, A. Jindal, Y. Kargin, V. Setty, and J. Schad. Hadoop++: Making a yellow elephant run

like a cheetah (without it even noticing). Proceedings of the VLDB Endowment,3(1-2):515–529, 2010.

[10] http://dwgeek.com/hive-different-file-formats-text-sequence-rc-avro-orc-parquet-file.html/

[11] https://snippetessay.wordpress.com/2015/07/25/hive-optimizations-with-indexes-bloom-filters-and-statistics/

[12] http://web.cse.ohiostate.edu/hpcs/WWW/HTML/publications/papers/TR-11-4.pdf

[13] https://www.semantikoz.com/blog/orc-intelligent-big-data-file-format-hadoop-hive/

[14] A. Floratou, U. F. Minhas, and F. Ozcan. Sql-on-hadoop: Full circle back to shared-nothing database architectures.

Proceedings of the VLDB Endowment, 7(12), 2014.

[15] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang,S. Antony, H. Liu, and R. Murthy. Hive-a petabyte

scale data warehouse using hadoop. In Data Engineering (ICDE), 2010 IEEE 26th International Conference on, pages 996–

1005. IEEE,2010.

International Journal of Research

Volume 7, Issue IX, September/2018

ISSN NO:2236-6124

Page No:171