Hive Tutorials - Hive Getting Started

Apache Hive Tutorials Hive Getting Started

Prashant Kumar Pandey

http://hive.apache.org/

http://www.thriveschool.blogspot.in/







Learn more with us at http://www.thriveschool.blogspot.in/

In my previous post, we saw how we can execute MapReduce jobs using Java. Java is most

flexible and powerful method for doing all MapReduce tasks but it requires lot of time and

engagement. There is a lot which is repetitive during data analytics process and hence an

opportunity for a high level tool to accomplish those things easily hiding all the complexity

inside. That’s where Hive comes in.

It provides a familiar model for those who know SQL and allow them to think and work in

database perspective. When commands and queries are submitted to hive, it goes to the driver.

Driver will compile, optimize and execute those using steps of MapReduce jobs.

It seems obvious that driver will generate java MapReduce jobs internally but that’s not the

fact. Hive has generic Mapper and Reduces module which operate based on information in an

XML file.

When we create a table, our table schema and other system metadata is stored in a separate

Meta data store. This metastore is a traditional relational database usually MySQL.

Hive gives a nice and quick startup for those who are familiar with SQL and a high level easy

tool for all to accomplish data analysis.

Let’s get started with hive.

In case you do not have HortonWorks HDP Sand box setup available with you or you are new to

HDP sandbox, I recommend going through at least below posts.

1. Getting a portable hadoop environment

2. Using HDFS file system.

You need to login to your virtual box using root user and password hadoop. It is not advisable

to work as root user so I created a new user for me to work with hive. In Linux you can use

useradd user_name command to create a new user. After creating new user, you must set a

password for it using passwd user_name command. Once new user is created, logoff root user

and login using credentials you just created.

Follow below screen and explanation given here.

First command I executed is hive. This command will start hive command line interface and

your prompt will change to hive>. Once you see this command line, you are in hive CLI and

ready to execute hive commands.





http://www.thriveschool.blogspot.in/2013/10/geting-portable-hadoop-environment.html

http://www.thriveschool.blogspot.in/2013/11/using-hdfs-file-system.html


If you are familiar with other relational databases like Oracle, MySQL or MS SQL Server, you

must be aware with concept of database and schema. In hive, database base and schema are

synonymous. In hive, both of them actually are just a namespace. They are just providing a

method for organizing tables into a logical group. This grouping is valuable in large clusters

when multiple people working in team to avoid table name conflicts.

Next command which I have executed in above screenshot is a hive CLI command called set. In

hive set command is used to set or display variables. We will talk about it in more detail later.

For now, I have set a configuration variable hive.cli.print.current.db as true. Once this variable

is set to true, hive prompt will also display current database you are working in. you must have

noticed that after set command is executed, prompt is changed from hive> to hive (default)>.

In this case, we are working in default database which is now displayed as part of prompt.

Next command is executed to demonstrate that we can use set command to display value of a

variable. In this case, I displayed value for a variable hive.metastore.warehouse.dir. This is

another configuration variable in hive which stores directory location where hive will create all

my databases and tables. We will demonstrate it in detail further down.

When we start hive CLI using hive command, it looks for a file named .hiverc in your home

directory. If .hiverc file is found, CLI will execute all commands placed in this file. Yes, you are

right, you can place your set hive.cli.print.current.db=true; command in this file so every time

you start your CLI, it shows your current database in prompt.

But I don’t want to use default database so let’s create a new database using create database

database_name; command as shown in screenshot below.

You can use describe database database_name; command to describe your database. You can

see in the screen below that shows a URI as

hdfs://sandbox:8020/apps/hive/warehouse/pkdb.db.

You must have noticed that /apps/hive/warehouse is the location which we saw as output of

set command in previous screen. So my database pkdb (automatically suffixed with .db) is







created under this directory in HDFS file system.

In next command use dabase_name; I changed my current database from default to newly

created pkdb database. Hence hive prompt is also changed accordingly. Now i can place use

pkdb; command in my .hiverc file so I am always using my own database instead of default.

Let’s go to /apps/hive/warehouse directory in hadoop file system to check what is created

there. In the screen below, I have just listed content of this directory in hadoop file system. I

can see pkdb.db is created as a directory under this location.

Great, which means, in hive, database is nothing more than a directory.

Let’s create a table in our newly created database. Method for creating a table is almost similar

to other database, we can do it using create table table_name (column_name

column_data_type); as shown in below screen. Table is created and I fired a select statement

on this table. I do not get any records out from the table because there is no data in the table

but isn’t it simple. If you already know SQL it’s just matter of days for you to learn hive. Keep

reading.








Now, since I have table, I want to go back to the HDFS file system and check what is created in

my database for my new table. You will be surprised to see that it’s again a directory.

Great, now you have learned getting into hive. Let’s drop the table and database we created.

One last thing, I mentioned in beginning that for hive, database and schema are synonymous.

That means you can use create schema schema_name; instead of create database

database_name; and result for both the command is same.

We will catch-up again on hive in more detail in future.

Keep reading…..Keep learning…..Keep growing.









Technology

Hive Tutorials - Hive Getting Started