Analysis of historical movie data by BHADRA

Preview:

Citation preview

COMPUTER SCIENCE AND ENGINEERING

ANALYSIS OF HISTORICAL MOVIE DATA BY USING HADOOP SYSTEM

INTERNAL GUIDE:T.CHANDRA SHEKAR REDDY

:G.VEERABHADRA(13R21A05C8)

Abstract Requirements Dataflow Diagram Methodology Screenshots Future Extension Conclusion References

INDEX

Recommendation system provides the facility to understand a person's taste and find new, desirable content for them automatically based on the pattern between their likes and rating of different items. In this paper, we have proposed a recommendation system for the large amount of data available on the web in the form of ratings, reviews, opinions, complaints, remarks, feedback, and comments about any item (product, event, individual and services) using Hadoop Framework.

ABSTRACT

Hadoop 2.x My Sql HDFS Hive Pig Hue JDK 1.6

REQUIREMENTS

Dataflow Diagram

MS Excel (datasets in csv

format)

Import into cloudera home

Load the data into mysql

Create database in mysql

Load the data into hive using

sqoop

Load the data into Hue

Hadoop Distributed File System (HDFS): The Hadoop Distributed File System (HDFS) is designed to store very large data

sets reliably, and to stream those data sets at high bandwidth to user applications. In a large cluster, thousands of servers both host directly attached storage and execute user application tasks.

An important characteristic of Hadoop is the partitioning of data and computation across many (thousands) of hosts, and the execution of application computations in parallel close to their data.

HDFS

HDFS Architecture:

• Hive is a data warehousing frame work in hadoop where we store data in the form of tables ( structured format).Hive runs on the top of hdfs and mapreduce.

• The back end storage for hive is hdfs and executing model is mapreduce. • Hive provides SQL like language called HiveQL(HQL). HQL is very similar to

SQL.• Hive is designed for scalability and easy of use.

HIVE

Tinyint(1 byte) SmallInt(2 bytes) int(4 bytes) Bigint(8 bytes) float(4 bytes) double(8 bytes) String(max size 2gb) varchar(hive-0.12.0 supports 1 to 65535 characters) Boolean --->true/false

HIVE complex data types:

sqoop is a tool designed to transfer data between hadoop and relational databases. You can use sqoop to import data from a relational database management system such as MYSQL,or ORACLE into the hadoop distributed file system and then export the data back into an RDBMS.

Sqoop automates most of the this process, relying on the database to describe the schema for the data to be imported . Sqoop uses mapreduce to import and export the data which provides parallel operations as well as fault tolerance.

SQOOP

Copy the file from windows to cloudera. For creating the database: Mysql>create database name; For using the database: Mysql>use name;

COMMAND

For creating table name: Mysql>create table tablename(….);

COMMAND

To import data sets in to MYSQL the following command is used:load the file Mysql>load data local infile ‘path of the file’ into table tablename fields

terminated by ‘,’ enclosed by ‘”’ lines terminated by ‘\r\n’; exit;

COMMAND

For importing the data from mysql to hive the following command is used: Sqoop import –connect jdbc:mysql//localhost/datbasename --username root –password cloudera --table tablename --fields-terminated-by ’,’ --hive -import -m 1

To log in to HUE:username: Clouderapassword: Clouderago to hive editor.

Where at the left side we have to select database and at the right side we can try some analytical queries on the tables created. Once the result is displayed select some charts and repeat the same process for all the respective years.

COMMAND

Representing bar graphs between title and rating

Representing bargraphs between Budget_in_crores and collection

Representing bargraphs between Year and collection

Clearly Big Data is in its beginnings, and is much more to be discovered. This technology itself brings business benefits by being leveraged across domains like Big Data, Business Intelligence and Analytics. These business benefits are:

Speed and Accelerated performanceGood query performance for improved decision making, boost of performance for data load processes for a low data latency, accelerated memory planning capabilities.

New Business InsightsSelf-service BI and more flexible modeling capabilities.Faster Business Processes.

FUTURE EXTENSION

The availability of Big Data, low-cost commodity hardware, and new information management and analytic software has produced a unique moment in the history of data analysis. The convergence of these trends means that we have the capabilities required to analyze astonishing data sets quickly and cost-effectively for the first time in history. These capabilities are neither theoretical nor trivial. They represent a genuine leap forward and a clear opportunity to realize enormous gains in terms of efficiency, productivity, revenue, and profitability. The Age of Big Data is here, and these are truly revolutionary times if both business and technology professionals continue to work together and deliver on the promise. Promises of Big Data include innovation, growth and long term sustainability.

From the results we can analyze the movies and project reports like the best rated, highest budget and highest collection with in a click.

CONCLUSION

https://www.tutorialspoint.com/ http://hadooptutorials.co.in/tutorials/hadoop/internals-of-hdfs-file-read-

operations.html http://www.hadooptpoint.com/hadoop-hive-architecture/ http://downloads.vmware.com/d/info/desktop_downloads/vmware_workstation/7_0 http://www.cloudera.com/ Hadoop: The Definitive Guide -- John White Big Data Analytics -- Wiley 

REFERENCES

Screenshot(Implementation):

Gantt Chart (definition):Gantt chart is a chart in which a series of horizontal lines shows the amount of work done or production completed in certain periods of time in relation to the amount planned for those periods.

Future Work: In the further process we will be analyzing the datasets which are loaded in the Hive using Hue or R tool.

Conclusion:In this project we have loaded large set of datasets in to HDFS using Sqoop and Hive Further the movie data can be easily analyzed using Hue.