Upload
manvi-chandra
View
439
Download
1
Embed Size (px)
Citation preview
CIS 520: Software Engineering
1
Movie Data Analysis using Hive QL SUBMITTED TO: DR. JONGWOOK WOO
Kumari Parul BisenKrutik ShahManvi Chandra
CIS 520: Software Engineering
2Table of Contents Movies Project Description Hadoop,Hive,PowerView Cloudberry Explorer for Azure Blob Storage Flowchart Relation to SDLC Hive Queries for data analysis Output and Visualization on graphs Dashboard
CIS 520: Software Engineering
3What is Movie Dataset ?
We have extracted data related to movies from http://www.the-numbers.com/ .
The-Numbers has tracked over 20,000 movies.
This data analysis is based on MPAA(Motion Picture Association of America) Ratings, Rankings, Genre, Gross Profit and Tickets sold.
CIS 520: Software Engineering
4Project Description
We are basically analyzing the movie data using Hive QL
The results obtained are exported into excel sheets.
The visualization of the analyzed data is done using Power View query in MS-Excel.
CIS 520: Software Engineering
5Hadoop
Hadoop- Hadoop is an open source framework utilized for processing humungous datasets and also used for distributed storage.
A particular special type of computational cluster is built in order to store and analyze large volumes of unstructured data is known as a Hadoop cluster.
Hadoop clusters are gaining popularity for enhancing the speed of data analysis applications. Hadoop clusters are extremely scalable.
Hadoop clusters are highly efficient as they are resistant to failures.
CIS 520: Software Engineering
6Hive
Hive is a data warehouse system for Hadoop. It allows querying, data analysis utilizing HiveQL etc. Hive enables users to potray structure on huge unstructured data. Hive has the ability to understand organized and unorganized data
which may include text files where fields are circumscribed by specific characters.
CIS 520: Software Engineering
7PowerView
PowerView is an add in which allows customers collect ,store,build and analyze huge volumes of data in excel.
PowerView is capable of providing intuitive data & visualization of power pivot models.
PowerView is similar to excel visualization layer.
CIS 520: Software Engineering
8Cloudberry Explorer for Blob Storage
It is leveraged by Microsoft Azure Storage Analytics.
It is available in two versions freeware and Pro.
We have used this tool to upload data from local to Azure storage blob.
CIS 520: Software Engineering
9Flowchart
Download data from
data source
Format the file in the form
of txt
Uploading the files
on Cloudberry Explorer
for Microsoft
Azure Blob
Storage
Use HiveQL to
create external tables.
Use Query results
and powervie
w to analyze
data
Dashboard
visualiztion
CIS 520: Software Engineering
10Relation with SDLC
Determining the Scope,
Time Estimation
and Expected Output
Gathering Data through
The-Numbers.co
m and analysing.
Designing – acquire
necessary software for executing i.e., OBDC, HD Insight, Microsoft
azure.
Implement - Developed programs, prepared
documents
Testing and Maintaining
CIS 520: Software Engineering
11Transfer data from Local to HD Insight
CIS 520: Software Engineering
12Hive Queries
CIS 520: Software Engineering
13Recommendation based on the Analysis
We are using recommendation technique named content based filtering on the basis of which we are trying to figure out the most popular movies.
Content-based filtering approach utilizes a series of discrete characteristics of an item in order to recommend additional items with similar properties.
In our dataset in order to find the most popular movies we are considering Rank, Gross revenue earned and the Number of Tickets sold.
CIS 520: Software Engineering
14Output and Visualizations
CIS 520: Software Engineering
15Output and Visualizations
CIS 520: Software Engineering
16Output and Visualizations
CIS 520: Software Engineering
17Output and Visualizations
CIS 520: Software Engineering
18Output and Visualizations
CIS 520: Software Engineering
19Conclusion:
Data analysis using HiveQl.
Exporting of analyzed data to Excel and data representation using PowerView .
Visualization using Dashboard.
CIS 520: Software Engineering
20References
Github.com https://azure.microsoft.com/en-us/documentation/samples/ http://hortonworks.com/hadoop-tutorial/how-to-process-data-with-
apache-hive/
CIS 520: Software Engineering
21THANK YOU 😊