View
680
Download
2
Embed Size (px)
DESCRIPTION
Citation preview
Apache Hive Tutorials Hive Getting Started
Prashant Kumar Pandey
Learn more with us at http://www.thriveschool.blogspot.in/
In my previous post, we saw how we can execute MapReduce jobs using Java. Java is most
flexible and powerful method for doing all MapReduce tasks but it requires lot of time and
engagement. There is a lot which is repetitive during data analytics process and hence an
opportunity for a high level tool to accomplish those things easily hiding all the complexity
inside. That’s where Hive comes in.
It provides a familiar model for those who know SQL and allow them to think and work in
database perspective. When commands and queries are submitted to hive, it goes to the driver.
Driver will compile, optimize and execute those using steps of MapReduce jobs.
It seems obvious that driver will generate java MapReduce jobs internally but that’s not the
fact. Hive has generic Mapper and Reduces module which operate based on information in an
XML file.
When we create a table, our table schema and other system metadata is stored in a separate
Meta data store. This metastore is a traditional relational database usually MySQL.
Hive gives a nice and quick startup for those who are familiar with SQL and a high level easy
tool for all to accomplish data analysis.
Let’s get started with hive.
In case you do not have HortonWorks HDP Sand box setup available with you or you are new to
HDP sandbox, I recommend going through at least below posts.
1. Getting a portable hadoop environment
2. Using HDFS file system.
You need to login to your virtual box using root user and password hadoop. It is not advisable
to work as root user so I created a new user for me to work with hive. In Linux you can use
useradd user_name command to create a new user. After creating new user, you must set a
password for it using passwd user_name command. Once new user is created, logoff root user
and login using credentials you just created.
Follow below screen and explanation given here.
First command I executed is hive. This command will start hive command line interface and
your prompt will change to hive>. Once you see this command line, you are in hive CLI and
ready to execute hive commands.
Learn more with us at http://www.thriveschool.blogspot.in/
If you are familiar with other relational databases like Oracle, MySQL or MS SQL Server, you
must be aware with concept of database and schema. In hive, database base and schema are
synonymous. In hive, both of them actually are just a namespace. They are just providing a
method for organizing tables into a logical group. This grouping is valuable in large clusters
when multiple people working in team to avoid table name conflicts.
Next command which I have executed in above screenshot is a hive CLI command called set. In
hive set command is used to set or display variables. We will talk about it in more detail later.
For now, I have set a configuration variable hive.cli.print.current.db as true. Once this variable
is set to true, hive prompt will also display current database you are working in. you must have
noticed that after set command is executed, prompt is changed from hive> to hive (default)>.
In this case, we are working in default database which is now displayed as part of prompt.
Next command is executed to demonstrate that we can use set command to display value of a
variable. In this case, I displayed value for a variable hive.metastore.warehouse.dir. This is
another configuration variable in hive which stores directory location where hive will create all
my databases and tables. We will demonstrate it in detail further down.
When we start hive CLI using hive command, it looks for a file named .hiverc in your home
directory. If .hiverc file is found, CLI will execute all commands placed in this file. Yes, you are
right, you can place your set hive.cli.print.current.db=true; command in this file so every time
you start your CLI, it shows your current database in prompt.
But I don’t want to use default database so let’s create a new database using create database
database_name; command as shown in screenshot below.
You can use describe database database_name; command to describe your database. You can
see in the screen below that shows a URI as
hdfs://sandbox:8020/apps/hive/warehouse/pkdb.db.
You must have noticed that /apps/hive/warehouse is the location which we saw as output of
set command in previous screen. So my database pkdb (automatically suffixed with .db) is
Learn more with us at http://www.thriveschool.blogspot.in/
created under this directory in HDFS file system.
In next command use dabase_name; I changed my current database from default to newly
created pkdb database. Hence hive prompt is also changed accordingly. Now i can place use
pkdb; command in my .hiverc file so I am always using my own database instead of default.
Let’s go to /apps/hive/warehouse directory in hadoop file system to check what is created
there. In the screen below, I have just listed content of this directory in hadoop file system. I
can see pkdb.db is created as a directory under this location.
Great, which means, in hive, database is nothing more than a directory.
Let’s create a table in our newly created database. Method for creating a table is almost similar
to other database, we can do it using create table table_name (column_name
column_data_type); as shown in below screen. Table is created and I fired a select statement
on this table. I do not get any records out from the table because there is no data in the table
but isn’t it simple. If you already know SQL it’s just matter of days for you to learn hive. Keep
reading.
Learn more with us at http://www.thriveschool.blogspot.in/
Now, since I have table, I want to go back to the HDFS file system and check what is created in
my database for my new table. You will be surprised to see that it’s again a directory.
Great, now you have learned getting into hive. Let’s drop the table and database we created.
One last thing, I mentioned in beginning that for hive, database and schema are synonymous.
That means you can use create schema schema_name; instead of create database
database_name; and result for both the command is same.
We will catch-up again on hive in more detail in future.
Keep reading…..Keep learning…..Keep growing.