Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Performance Evaluation and Optimization of Hybrid Indexes in Hive
1 G.Sravanthi, 2 C.V.S.Satyamurty
1,2 Computer Science of Engineering, Information Technology
1,2 CVR College of Engineering, Hyderabad, India
E-mail:[email protected], E-mail: [email protected]
Abstract: Reducing risks & reduced costs for the business, Big data technologies are providing important
and accurate analysis, which leads to concrete decision-making results in grater operational efficiencies. An
infrastructure would be required to obtain all the features of bid data, for managing and processing huge
volumes of indexed and non-indexed data in real-time and to protect data privacy and security. Many Vendors
like Amazon, IBDM, Microsoft etc products in these technologies are available to handle big data are
categorized into two Operational Big Data and Analytical Big Data. Hive a data warehouse tool for easy
querying and analyzing processes structured data in Hadoop which resides on top of Hadoop to recapitulate Big
Data. Hive supports Compact and Bitmap indexing, to improve the query performance. These indexes are used
individually on Hive; we have implemented both the indexes on the same partition in Hive, and got good
performance optimization. In this paper text data is used for analyzing the Simultaneous combo Indexes
(Compact & Bitmap) on the partition had resulted in good performance Optimization.
Keywords: Hive, Hadoop, Indexing, Compact Indexing, Hadoop infrastructure, Big Data, HiveQL.
1. INTRODUCTION
Enterprise Organizations are using data to avail better decisions making to enhance their performances. The
decade of 21st century showcased the fast changes in the availability of data and its efficacy of the Organization.
This has also revolutionize the usage of data and brought the concept of Big Data [1]
Big Data
Availability of huge amount of data where storing, processing and mining or extracting is a difficult
using traditional storage [2]. These concepts have been the reasons embraced Online Firms (Google, eBay,
Facebook, LinkedIn etc).
Big Data for Small and Big Companies
There are many reasons that Why Big Data was first accepted by Online Firms. These Enterprise and
startups go around the concept of using rapidly challenging data and facing the issues of integrating unstructured
data with the existing techniques [3]. The following are the issues or challenges regarding Big-Data are as
follows.
Volume: Huge amount of data available made a challenge as it can be neither possible nor efficient to mange
using traditional databases.
International Journal of Research
Volume 7, Issue IX, September/2018
ISSN NO:2236-6124
Page No:160
Variety: In comparison to earlier text and table format, current available versions are in the form of pictures,
videos and tweets etc.
Velocity: Increased usage of Online Space and that the data was available was rapidly changing and have to be
made available instantly and effectively [4].
Open Source Tool for Big Data Analytics
A distributed software solution Hadoop is a scalable fault tolerance for data storage, process, and
extracting are the main uses.
i. HDFS (which is storage)
ii. Map Reduce (Fig. 1)
Fig1: Hadoop HDFS, Map Reduce Logical View
Fig2: The Hadoop technology stack
2. Related Word
Hive
Data warehouse for OLAP workloads to handle and query is the Hive over vast volume of data in
distributed storage. The Hadoop Distributed File System (HDFS)[5] is the ecosystem which it maintains huge
volumes of data reliably from hardware faults [6].
International Journal of Research
Volume 7, Issue IX, September/2018
ISSN NO:2236-6124
Page No:161
Fig3: Layered Architecture of Big Data System
The SQL-like relational data warehousing approach Hive is developed on top of Hadoop and a massive
computation at a bandwidth in Hadoop enabling data streams called MapReduce is a high-level programming
model.
HiveQL
It is a SQL-like query language, is used to query over Hive. Hive does not support SQL Commands
(Update, Delete, and Insert at Row-Level) and transactions also. Hadoop Client Hive is to function with rare
update and batch-mode data insertions are the characteristics of the Hive as a underlying system. Hive an OLAP
data For Online Transaction Processing (OLTP) features with big data, one should consider a NoSQL database
[7].
Data in Hive
Hive Data Types
Hive common relational databases data types as well as the three collection types: STRUCT, MAP, and
ARRAY.
Hive File Formats
Hive can process data coming from different data sources using Extract, Transform and Load (ETL)
tools for reading data from different file format and their customization.
Index Construction in Hive
CREATE INDEX index_name
ON TABLE base_table_name (col_name, ...)
AS 'index.handler.class.name'
International Journal of Research
Volume 7, Issue IX, September/2018
ISSN NO:2236-6124
Page No:162
WITH DEFERRED REBUILD
[IDXPROPERTIES (property_name=property_value, ...)]
[IN TABLE index_table_name]
[PARTITIONED BY (col_name, ...)]
[[ROW FORMAT ...] STORED AS ...| STORED BY ...]
[LOCATION hdfs_path]
[TBLPROPERTIES (...)]
[COMMENT "index comment"]
There are some parameters and keywords used in the command listing:
1. index_name: The index name given by the user used to access the index by the user itself. This name is used to
refer to the index in ALTER INDEX and DROP INDEX commands.
2. base_table_name (col_name,...): The table over which the index is to be created using the desired columns
listed.
3. 'index.handler.class.name': this specifies the type of the index, which could be bitmap or compact values.
4. index_table_name: The index name given by Hive (the default name) or the user, used to access the index as a
table by the user or Hive. For example the user can see the content of an index using this parameter. Any other
behavior of the index as a table can be addressed using this name.
5. WITH DEFERRED REBUILD: This means the index starts empty. Index will be populated using:
ALTER INDEX index_name ON table_name [PARTITION (...)] REBUILD
6. IDXPROPERTIES: This gives the properties of the index. For example: 'index_creator' = 'Mahsa' can be
considered a property for an index.
7. IN TABLE index_table_name: This is used when the user wishes to build an index in a different table from
an already built index (if any), to keep them separate.
Hive Architecture
Fig4: Apache Hive Architecture
International Journal of Research
Volume 7, Issue IX, September/2018
ISSN NO:2236-6124
Page No:163
Major components of the Apache Hive are:
1. Metastore
2. Driver
3. Compiler
4. Optimizer
5. Executor CLI, UI, and Thrift Server
Hive supports three kinds of indexes, Compact Index, Aggregate Index, and Bitmap index. These are
tables in Hive containing every index dimension and array of location of associated Index. Number of index
dimensions increases table becomes very huge which occupies large amount of disk space.
The aim of creating INDEX is to improve the performance of data retrieval. For Instance, let us say you
are querying with filter condition WHERE column_name = 100, without index will load entire table or partition
to process records and with index on column_name would load part of file to process the records.
Compact Index on Hive Table
Pair Indexed Column’s value and its bolckid of Index Column are stored in Compact indexing.
CREATE COMPACT INDEX:
CREATE INDEX index_name
ON TABLE base_table_name (col_name, ...)
AS index_type
[WITH DEFERRED REBUILD]
[IDXPROPERTIES (property_name=property_value, ...)]
[IN TABLE index_table_name]
[ [ ROW FORMAT ...] STORED AS ...
| STORED BY ... ]
[LOCATION hdfs_path]
[TBLPROPERTIES (...)]
[COMMENT "index comment"];
Example
hive> CREATE INDEX index_students ON TABLE students(id)
>AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'
> WITH DEFERRED REBUILD;
OK
Time taken: 0.493 seconds
ALTER COMPACT INDEX
ALTER INDEX index_name ON table_name [PARTITION partition_spec] REBUILD;
International Journal of Research
Volume 7, Issue IX, September/2018
ISSN NO:2236-6124
Page No:164
Example
hive> ALTER INDEX index_students ON students REBUILD;
Query ID = cloudera_20171111093030_6a37b92b-bae8-4fd1-91bb-d13e9d411513
Total jobs = 1
...
Total MapReduce CPU Time Spent: 4 seconds 180 msec
OK
DROP COMPACT INDEX
DROP INDEX [IF EXISTS] index_name ON table_name;
Example
hive> DROP INDEX IF EXISTS index_students ON students;
OK
Time taken: 0.27 seconds
Bitmap Index on Hive Table
Combining indexed column value and list of rows are stored as bitmap in Bitmap indexing.
CREATE BITMAP INDEX:
CREATE INDEX olympic_index_bitmap
ON TABLE olympic (age)
AS 'BITMAP'
WITH DEFERRED REBUILD;
ALTER INDEX olympic_index_bitmap on olympic REBUILD;
Indexing Advantages
1. Indexes improve the query performance.
2. Multiple Indexes can be created on the same table.
3. Any type of indexing can be created on the data.
4. Depending on data Bitmap indexes are faster than Compact indexes and vice versa.
3. Preliminaries and Definitions
HDFS: Hadoop Distributed File System
YARN: Yet Another Resource Negotiator
DN: Data Node
NN: Name Node
SN: Secondary Name Node
RM: Resource Manager
NM: Node Manager
International Journal of Research
Volume 7, Issue IX, September/2018
ISSN NO:2236-6124
Page No:165
CLI: Command Line Interface
UI: User Interface
OLTP: Online Transaction Processing
OLAP: Online Analytical Processing
4. System Model
In general the indexes (Compact Index and Bitmap Index) are applied as a separate entity for performance
optimization. In the proposed model both the indexes are applied together and tested on numerous data values
which resulted in more performance optimization.
Fig 5: applying both Indexes
5. Performance Analysis
Table1: Hive Query statistics without indexing (all the results are in seconds)
Rows/Tables Table 1 Table 2 Table 3 Table 4
50 0.628 0.1 0.085 0.167
100 0.14 0.17 0.087 0.17
150 0.137 0.07 0.0204 0.098
500 0.139 0.181 0.093 0.109
1000 0.251 0.074 0.159 0.108
5000 0.161 0.125 0.089 0.196
6000 0.143 0.225 0.089 0.117
10000 0.162 0.092 0.17 0.097
11000 0.145 0.104 0.085 0.07
11500 0.14 0.098 0.086 0.176
Hive
Compact Bitmap
International Journal of Research
Volume 7, Issue IX, September/2018
ISSN NO:2236-6124
Page No:166
Chart1: Hive Query statistics without indexing (all the results are in seconds)
Table2: Hive Query statistics with Bitmap Indexing
Rows/Tables Table 1 Table 2 Table 3 Table 4
50 0.129 0.935 0.085 0.086
100 0.157 0.104 0.095 0.182
150 0.104 0.09 0.181 0.096
500 0.139 0.098 0.154 0.084
1000 0.158 0.097 0.196 0.089
5000 0.123 0.175 0.173 0.096
6000 0.113 0.175 0.081 0.124
10000 0.144 0.108 0.082 0.08
11000 0.169 0.192 0.191 0.089
11500 0.121 0.096 0.095 0.185
Chart2: Hive Query statistics with Bitmap Indexing
International Journal of Research
Volume 7, Issue IX, September/2018
ISSN NO:2236-6124
Page No:167
Table3: Hive Query with Compact Indexing
Rows/Tables Table 1 Table 2 Table 3 Table 4
50 0.117 0.092 0.215 0.097
100 0.121 0.265 0.09 0.077
150 0.106 0.085 0.105 0.072
500 0.117 0.09 0.17 0.161
1000 0.101 0.161 0.081 0.082
5000 0.184 0.173 0.17 0.075
6000 0.135 0.144 0.099 0.085
10000 0.18 0.178 0.163 0.175
11000 0.164 0.11 0.11 0.068
11500 0.212 0.092 0.09 0.144
Chart3: Hive Query with Compact Indexing
Table4: Hive Query with Both Bitmap and Compact Indexing together
Rows/Tables Table 1 Table 2 Table 3 Table 4
50 0.098 0.101 0.097 0.098
100 0.134 0.092 0.099 0.173
150 0.112 0.082 0.107 0.071
500 0.114 0.17 0.091 0.083
1000 0.106 0.08 0.173 0.083
5000 0.094 0.084 0.083 0.079
6000 0.242 0.116 0.105 0.094
10000 0.198 0.1 0.077 0.107
11000 0.185 0.09 0.148 0.174
11500 0.104 0.185 0.202 0.099
International Journal of Research
Volume 7, Issue IX, September/2018
ISSN NO:2236-6124
Page No:168
Note (Line Chart Legend):
Series 1: Table 1
Series 2: Table 2
Series 3: Table 3
Series 4: Table 4
X – Axis Row in 1000 Y-Axis Time in Seconds
Chart4: Hive Query with Both Bitmap and Compact Indexing together
Table5: Average Hive Query with no index, Bitmap, Compact and Both Bitmap and Compact Indexing
together
Rows/Index No Index Bitmap Compact Both
50 0.245 0.30875 0.13025 0.0985
100 0.14175 0.1345 0.13825 0.1245
150 0.08135 0.11775 0.092 0.093
500 0.1305 0.11875 0.1345 0.1145
1000 0.148 0.135 0.10625 0.1105
5000 0.14275 0.14175 0.1505 0.085
6000 0.1435 0.12325 0.11575 0.13925
10000 0.13025 0.1035 0.174 0.1205
11000 0.101 0.16025 0.113 0.14925
11500 0.125 0.12425 0.1345 0.1475
International Journal of Research
Volume 7, Issue IX, September/2018
ISSN NO:2236-6124
Page No:169
Chat5: Average Hive Query with no index, Bitmap, Compact and Both Bitmap
and Compact Indexing together
Chart6: (Bar chart) Average Hive Query with no index, Bitmap, Compact and Both Bitmap and Compact
Indexing together
Note (Bar Chart Legend):
Series 1: No Index
Series 2: Bitmap Index
Series 3: Compact Index
Series 4: Both Index
X – Count of Values Y-Axis Time in Seconds
6. Conclusion and Future Scope
Hence in we conclude that Individually Compact and Bitmap indexes have a good performance in data
analysis. In our implementation it has proved that applying both the indexes in Hive Partition had given the best
performance in analyzing single dimensional data. Further these have to be analyzed on audio, video, and image
type data to be analyzed with multi dimensional index on apache spark environment.
International Journal of Research
Volume 7, Issue IX, September/2018
ISSN NO:2236-6124
Page No:170
REFERENCES
[1] M. A. Beyer and D. Laney, “The importance of ‟big data‟: A definition,” Gartner, Tech. Rep., 2012..
[2] X. Wu, X. Zhu, G. Q. Wu, et al., “Data mining with big data,” IEEE Trans. on Knowledge and Data Engineering, vol.
26, no. 1, pp. 97-107, January 2014.Rajaraman and J. D. Ullman, “Mining of massive datasets,” Cambridge University Press,
2012.
[3] Z. Zheng, J. Zhu, M. R. Lyu. “Service-generated Big Data and Big Data-as-a-Service: An Overview,” in Proc. IEEE
BigData, pp. 403-410, October 2013. A . Bellogín, I. Cantador, F. Díez, et al., “An empirical comparison of social,
collaborative filtering, and hybrid recommenders,” ACM Trans. on Intelligent Systems and Technology, vol. 4, no. 1, pp. 1-37,
January 2013.
[4] W. Zeng, M. S. Shang, Q. M. Zhang, et al., “Can Dissimilar Users Contribute to Accuracy and Diversity of Personalized
Recommendation?,” International Journal of Modern Physics C, vol. 21, no. 10, pp. 1217-1227, June 2010.
[5] Apache Hadoop [Online]. Available: http://hadoop.apache.org/
[6] Chansler, R., Kuang, H., Radia, S., Shvachko, K. ”The Hadoop Distributed File System”, in Proc. IEEE Conf. Mass
Storage Systems and Technologies (MSST), Incline Village, NV, 2010, pp. 1 – 10
[7] Apache Hbase [Online]. Available: http://hbase.apache.org/
[8] A.Abouzeid, K.Bajda-Pawlikowski, D. Abadi, A. Silberschatz,and A. Rasin. Hadoopdb: an architectural hybrid of
mapreducen and dbms technologies for analytical workloads. Proceedings of the VLDB Endowment, 2(1):922–933, 2009.
[9] J. Dittrich, J.-A. Quian´e-Ruiz, A. Jindal, Y. Kargin, V. Setty, and J. Schad. Hadoop++: Making a yellow elephant run
like a cheetah (without it even noticing). Proceedings of the VLDB Endowment,3(1-2):515–529, 2010.
[10] http://dwgeek.com/hive-different-file-formats-text-sequence-rc-avro-orc-parquet-file.html/
[11] https://snippetessay.wordpress.com/2015/07/25/hive-optimizations-with-indexes-bloom-filters-and-statistics/
[12] http://web.cse.ohiostate.edu/hpcs/WWW/HTML/publications/papers/TR-11-4.pdf
[13] https://www.semantikoz.com/blog/orc-intelligent-big-data-file-format-hadoop-hive/
[14] A. Floratou, U. F. Minhas, and F. Ozcan. Sql-on-hadoop: Full circle back to shared-nothing database architectures.
Proceedings of the VLDB Endowment, 7(12), 2014.
[15] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang,S. Antony, H. Liu, and R. Murthy. Hive-a petabyte
scale data warehouse using hadoop. In Data Engineering (ICDE), 2010 IEEE 26th International Conference on, pages 996–
1005. IEEE,2010.
International Journal of Research
Volume 7, Issue IX, September/2018
ISSN NO:2236-6124
Page No:171