17
Big data Analysis system interface: A technology tutorial Adel Awd EL-Agawany Abstract: this paper presents a desktop graphic user interface to manage big data analysis systems to help new researchers building multi-node cluster for analysing big data and managing the overall cluster from the master node only. First we present an introduction to big data containing big data definitions and challenges of it and what triggers to make such interface and refer to the functionalities of the system. Next we present the problem statement that focuses on the reasons for which building the interface. Following that we present the proposed solution containing the characteristics of the big data analysis system interface with details of each component. In addition we present a comparison between what before the Interface and what after it. Finally we make conclusion and references.

Last Report

Embed Size (px)

Citation preview

Page 1: Last Report

Big data Analysis system interface: A technology tutorial

Adel Awd EL-Agawany

Abstract: this paper presents a desktop graphic user interface to manage big data analysis systems to help new researchers building multi-node cluster for analysing big data and managing the overall cluster from the master node only. First we present an introduction to big data containing big data definitions and challenges of it and what triggers to make such interface and refer to the functionalities of the system. Next we present the problem statement that focuses on the reasons for which building the interface. Following that we present the proposed solution containing the characteristics of the big data analysis system interface with details of each component. In addition we present a comparison between what before the Interface and what after it. Finally we make conclusion and references.

Page 2: Last Report

Big data Analysis system interface: A technology tutorial 2014

1 | P a g e

1. Introduction: Big data is new concept in computer science that makes many computer science researchers make very interest on it; first we have to know the definition of big data. ``Big data is where the data volume, acquisition velocity, or data representation limits the ability to perform effective analysis using traditional relational approaches or requires the use of significant horizontal scaling for efficient processing''. It is new concept in computer science that makes many computer science researchers make very interest on it. One of the most important researches in big data is how to make analysis on it, and it isn’t an easy job to make analysis on big data because of many challenges. These challenges mainly found because of the need for extracting value from very large volumes of wide variety of data by enabling high-velocity capture, discovery and/or analysis. Also building a big data system is very complex and dealing with it is very hard as it is new research area especially for new researchers, so building and managing such systems isn’t easy. Research in this new scientific field is hard not only because of the previous reasons but also because of the limit knowing about it, so researchers always need using available tools, techniques and methodologies to help in their searches but Dealing with such tools without any background experience and without any usability to help them is so hard and make some researchers lose much time in that. So this paper introduces an easy interface to help dealing with big data analysis system using Open source “Apache HADOOP”. This interface make building big data analysis system easy by making researcher build multi node cluster for analyzing big data and managing this cluster by adding new nodes to the cluster, delete nodes from cluster, run Apache HADOOP to apply map reduce job on the cluster of machines and stop it after finishing, also allow using “Apache HADOOP web interface”

Page 3: Last Report

Big data Analysis system interface: A technology tutorial 2014

2 | P a g e

to monitor running jobs.

2. Problem statement: As we mentioned above big data analysis system is too complex in building and managing because of using huge amount of data from different data types and managing such complex system from command prompts is so hard and time consuming and need wide knowledge of different operating systems commands and management.

3. Proposed solution: We think about building an easy graphic user interface to build multi node cluster using “Open source Apache HADOOP framework” for getting values from big data and manage such systems easily. This system introduces some functionality which they are: adding new node, remove node, run HADOOP, run map/reduce job, use apache HADOOP web interface to monitor running jobs and to monitor tasks and also browse file system , and stop HADOOP. We clearly describe each one of them:

3.1. Adding new node: By this option the user can add new node to the cluster and configure it to be a working node and be a part of the system. The user just connect the machine “provided with SSH installed on it” to the cluster then write machine IP in the shown window, then the system will carry the responsibility of configure the new machine to act as a working node. These configurations divided into two major parts:

3.1.1. machine configuration Configuration of machine itself and these configurations are:

Add new HADOOP user. Generate public and private keys for secure access. Install java.

Page 4: Last Report

Big data Analysis system interface: A technology tutorial 2014

3 | P a g e

Disable IPV6. Initiate the new slave environment. Change host name to be representative. Add HADOOP to the new slave file system.

3.1.2. HADOOP configuration Configuration of HADOOP framework and these configurations are:

Configure Home. Configure HADOOP environment. Configure HADOOP HDFS. Configure HADOOP Map/Reduce. Configure HADOOP core/site. Format HDFS

3.2. Removing node: When user need to remove a working node from the cluster, he just click “Remove node” then the system will stop HADOOP if it is running and then ask him to enter the node IP to be removed, after that the system will delete this IP from the slaves file in HADOOP in the master node, by ending that the node is removed from the cluster.

3.3. Run HADOOP: Running HADOOP means make all nodes active for work, the section “all nodes” means activate all of the following:

Namenode: the master which manages the data inside HDFS. Secondary namenode: the node that takes a snapshots of the work progress

to be used processing in case of failure. Job tracker: the master node that manages the job as all and divides it into

small tasks and then assigns these tasks to tasktrackers to accomplish them. All tasktrackers: the slave nodes that the master assigns tasks to them to

accomplish them.

Page 5: Last Report

Big data Analysis system interface: A technology tutorial 2014

4 | P a g e

All datanodes: the slave nodes that the master assigns data for work to them.

This is the window that represents what we mentioned above.

Figure1: home window

Now we need to click Run HADOOP to activate all nodes.

Figure2: run HADOOP

Page 6: Last Report

Big data Analysis system interface: A technology tutorial 2014

5 | P a g e

Now HADOOP prepare nodes to be active “it will take few seconds”.

Figure3: HADOOP preparation

Now HADOOP is in running state.

Figure4: HADOOP Activate all nodes for work.

Page 7: Last Report

Big data Analysis system interface: A technology tutorial 2014

6 | P a g e

3.4. Run Map/Reduce job. Running Map/Reduce job means run the application logic that you intend to do. In our case we are going to get all words frequency from huge collections of text files. After click run application you will get a new window for browsing jar file to be implemented.

Figure 5: Run application. The system asks user to input file name in which output analysis data will be stored.

Figure 6: Enter output file

Page 8: Last Report

Big data Analysis system interface: A technology tutorial 2014

7 | P a g e

3.5. Monitor jobs from job tracker. Apache provides a web interface to monitor jobs. These interface allow user to see jobs either they done or in progress or failed.

Figure 7: Call job tracker for monitoring

Page 9: Last Report

Big data Analysis system interface: A technology tutorial 2014

8 | P a g e

After click jobtracker interface you will see the following web interface which all jobs applied before and in progress and also you will see all active nodes that works to accomplish the in progress job.

Figure 8: job tracker interface

Page 10: Last Report

Big data Analysis system interface: A technology tutorial 2014

9 | P a g e

3.6. Monitor tasks from task trackers. After monitoring job tracker to see job progress and working nodes number, we can see task trackers and their progress to apply tasks which assigned to them to be performed.

Figure 9: Call tasktracker

Page 11: Last Report

Big data Analysis system interface: A technology tutorial 2014

10 | P a g e

Now we can see one of the tasktrackers when applying its tasks either these tasks is completely finished or running.

Figure 10: Tasktracker interface

Page 12: Last Report

Big data Analysis system interface: A technology tutorial 2014

11 | P a g e

3.7. Browse file system from namenode. Here you can see namenode interface and browse file system by clicking on namenode interface as follow:

Figure 11: call namenode interface

Page 13: Last Report

Big data Analysis system interface: A technology tutorial 2014

12 | P a g e

Now you can see namenode interface and browse file system and see all data in HDFS and also see live nodes that used to apply job.

Figure 12: Namenode interface

Page 14: Last Report

Big data Analysis system interface: A technology tutorial 2014

13 | P a g e

After clicking on browse file system link and following some steps, you will see the content of the file system including the output file that you named before to contain output data

Figure 13: File system content

Page 15: Last Report

Big data Analysis system interface: A technology tutorial 2014

14 | P a g e

Now you can click on the link you need to show its content. For example the file called”adel_Awd_El-agawany” and you can download it to localhost.

Figure 14: show the file content

Page 16: Last Report

Big data Analysis system interface: A technology tutorial 2014

15 | P a g e

3.8. Stop HADOOP: After completing the job, you may need to stop HADOOP and deactivate all nodes.(it will take few seconds).

Figure 14: Stop HADOOP

Now HADOOP is stopped.

Figure 15: HADOOP is stopped

Page 17: Last Report

Big data Analysis system interface: A technology tutorial 2014

16 | P a g e

4. Comparison: After using the above interface to build and manage big data analysis system instead of making it manually using command prompt and other harmful techniques, user can feeling a huge difference and can safe too much time.

5. Conclusion: The era of big data is a very important and useful for research and contains a huge knowledge to be searched for. Getting value from big data is a very hard job because of many challenges included complexity. This paper tackled one of the most big data analysis challenges which is usability by introducing graphical user interface for users of big data analysis system. This interface helps users easily to build and manage big data analysis system by adding new nodes to the cluster, removing nodes, running HADOOP for applying patch job, and also monitoring job progress and tasks progress as well.

6. References: http://www.michael-noll.com/tutorials/

http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/

www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/

http://mysolvedproblem.blogspot.com/2012/05/installing-hadoop-on-ubuntu-linux-on.html

http://www.datasciencecentral.com/profile/YonggangWen

http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6842585