Machine Learning Hadoop

White Paper Machine Learning in the Enterprise Hadoop

©2014 Alethe Labs All rights reserved. Alethe Labs and the Alethe Labs logo are trademarks or registered trademarks of Alethe Labs. All other trademarks are the property of their respective companies. Information is subject to change without notice.

1. Abstract Over the past few years’ organizations are storing huge amounts of data sets and using this Big Data as their competitive advantage. The challenge lies when a human has to intervene in every scenario of processing and analyzing Big Data. The ideal case is made when machine can collaborate and help humans in decision-making. This white paper examines the role of machine learning in the most popular Big Data platform, Hadoop. This is one of the many white papers we plan to publish and understand the role of machine learning with Hadoop to enhance business processes. 2. Key Words Hadoop, Big Data, Machine Learning 3. Introduction Organizations today generate huge amount of data. This data includes both unstructured and structured data that is stored in silos of databases and archive solutions. Managing and analyzing this data using legacy systems is challenging and sometimes impractical. Companies today focus on generating business benefits out of their huge data sets. Hadoop is unparalleled as a high efficiency Big Data platform to store, process and analyze data in cost effective model. The challenge lies in acquiring knowledge base from the experts to understand the processing of data sets, the data interpretation techniques and creating value that empowers business decisions. Either we can create different processes for knowledge base or we can code the learning technique inside the machine i.e. Machine Learning.

4. Apache Hadoop & Big Data Apache Hadoop is an open source distributed software platform for storing and processing data running on multiple servers. Hadoop is written in Java. Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. In addition, it provides a distributed file system (DFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Using Hadoop, you can store petabytes of data reliably on tens of thousands of servers while scaling performance cost-effectively by merely adding inexpensive nodes to the cluster. Following is the Hadoop physical architecture with the master & slave nodes and the Hadoop logical architecture with Map Reduce.

White Paper Machine Learning in the Enterprise Hadoop

©2014 Alethe Labs All rights reserved. Alethe Labs and the Alethe Labs logo are trademarks or registered trademarks of Alethe Labs. All other trademarks are the property of their respective companies. Information is subject to change without notice.

In addition to Map Reduce the following components are useful in developing machine learning in Big Data systems powered by Hadoop. 5. Machine Learning Machine learning, a branch of artificial intelligence, concerns the construction and study of systems that can learn from data. Machine learning gives computers the power to learn from data without being explicitly programmed. Computers enabled with machine learning improve their performance by learning from training data and previous outcomes. Collaboration of man and machine improves the machine learning outputs faster and gives more accurate results. Human intuition is another key aspect to make the machine give better results every time a new data set is used or a new query is asked. 6. Apache Mahout Apache Mahout is a library of machine learning algorithms, implemented on top of Apache Hadoop and using the Map/Reduce paradigm. Once Big Data is available on Hadoop Distributed File System (HDFS), Mahout provides the necessary tools to automatically explore meaningful information out of those Big Data sets. Currently Mahout has three use cases a. Recommendation Mining: Mines

user behavior and recommends the items they like.

b. Clustering: Takes items and puts them into naturally occurring groups e.g. text documents, audio files.

c. Classification: Learns from the current categorization and assigns unlabeled items the best match group.

7. Machine Learning & Big Data Basic analytical tools for business intelligence work in limited domain of factors e.g. sum, count and average. Their capability is also limited to that of the user. Traditional techniques are not well suited for Big Data because it’s too large for a human to test the entire hypothesis and not limit to traditional calculations. Machine learning on the other hand is not limited to human capacity and the more data is fed into it, the better are the insights. 8. Conclusion Apache Hadoop provides a highly scalable and cost effective model for Big Data. Using machine-learning techniques like the Mahout library makes Hadoop consistently powerful, delivering better results as the data grows. 9. References ! http://hadoop.apache.org ! https://mahout.apache.org ! http://en.wikipedia.org/wiki/Machine

_learning

Software

Machine Learning Hadoop