Big Data Infrastructure: A Survey data...Language-scaling-storage LSS classiﬁcation Articlea Language Scaling Storage Al-Jarrah et al. [1] V DFS, IM Aridhi and Mephu [2] Scala, Java

Big Data Infrastructure: A Survey

Jaime Salvador1, Zoila Ruiz1, and Jose Garcia-Rodriguez2(B)

1 Universidad Central del Ecuador, Ciudadela Universitaria, Quito, Ecuador{jsalvador,zruiz}@uce.edu.ec

2 Universidad de Alicante, Ap. 99, 03080 Alicante, [email protected]

Abstract. In the last years, the volume of information is growing fasterthan ever before, moving from small datasets to huge volumes of infor-mation. This data growth has forced researchers to look for new alter-natives to process and store this data, since traditional techniques havebeen limited by the size and structure of the information. On the otherhand, the power of parallel computing in new processors has graduallyincreased, from single processor architectures to multiple processor, coresand threads. This latter fact enabled the use of machine learning tech-niques to take advantage of parallel processing capabilities offered by newarchitectures on large volumes of data. The present paper reviews andproposes a classification, using as criteria, the hardware infrastructuresused in works of machine learning parallel approaches applied to largevolumes of data.

Keywords: Machine learning · Big data · Hadoop · MapReduce · GPU

1 Introduction

Machine Learning tries to imitate human being intelligence using machines.Machine learning algorithms (supervised and unsupervised) use data to trainthe model. Depending on the problem modeled, the data may be of the orderof gigas. For this reason an optimized storage system for large volumes of data(Big Data) is indispensable.

In the last years, the term Big Data has become very popular because of thelarge amount of information available from many different sources. The amountof data is increasing very fast and the technology to manipulate large volumes ofinformation is the key when the purpose is the extraction of relevant information.The complexity of the data demands the creation of new architectures thatoptimize the computation time and the necessary resources to extract valuableknowledge from the data [44].

Traditional techniques are oriented to process information in clusters. Withthe evolution of the graphic processor unit (GPU) it appeared alternatives totake full advantage of the multiprocess capacity of this type of architectures.Graphic processors are in transition from processors specialized in graphic accel-eration to general purpose engines for high performance computing [7]. This typec© Springer International Publishing AG 2017J.M. Ferrandez Vicente et al. (Eds.): IWINAC 2017, Part II, LNCS 10338, pp. 249–258, 2017.DOI: 10.1007/978-3-319-59773-7 26

250 J. Salvador et al.

of systems are known as GPGPU (General Purpose Graphic Processing Units).In 2008, Khronos Group introduced the OpenCL (Open Computing Language)standard, which was a model for parallel programming. Subsequently appearedits main competitor, NVidia CUDA (Computer Unified Device Architecture).CUDA devices increase the performance of a system due to the high degree ofparallelism they are able to control [27].

This document reviews platforms, languages, and many other features of themost popular machine learning frameworks and offers a classification for someonewho wants to begin in this field. In addition an exhaustive review of works usingmachine learning techniques to deal with Big Data is done, including its mostrelevant features.

The remainder of this document is organized as follows. Section 2 describesthe techniques of machine learning and the most popular platforms. Next, anoverview about Big Data Processing is presented. Section 3 presents a summarytable with several classification criteria. Finally, in Sect. 4 some conclusions andopportunities for future work are presented.

2 Machine Learning for Big Data

This section reviews Machine Learning and Big Data processing platforms con-cepts. The first part presents a classification of the machine learning tech-niques to later summarize some platforms oriented to the implementation of thealgorithms.

2.1 Machine Learning

The main goal of machine learning is to create systems that learn or extractknowledge from data and use that knowledge to predict future situations, statesor trends [29].

Machine Learning techniques can be grouped into the following categories:

– Classification algorithms. A system is able to learn from a set of sample data(labeled data), from which a set of classification rules called model is built.This rules are used to classify new information [34].

– Clustering algorithms. It consists on the formation of groups (called clusters)of instances that share common characteristics without any prior knowledge[34].

– Recommendations algorithms. It consists of predicting patterns of preferencesand the use of those preferences to make recommendations to the users [36].

– Dimensionality Reduction Algorithms. It consists on the reduction in thenumber of variables (attributes) of a dataset without affecting the predic-tive capacity of the model [14].

Scalability is an important aspect to consider in a learning method. Suchcapacity is defined as the ability of a system to adapt to the increasing demandsin terms of information processing. To support this process onto large volumesof data, the platforms incorporate different forms of scaling [43]:

Big Data Infrastructure: A Survey 251

– Horizontal scaling, involves the distribution of the process over several nodes(computers).

– Vertical scaling, involves adding more resources to a single node (computer).

Machine Learning Libraries. In this section we present the description ofseveral libraries that implement some of the algorithms included in the previ-ous classification. The use of these libraries is proposed using as criteria theintegration with new systems but not and end user tool criteria.

Apache Mahout1 provides implementations for many of the most popularmachine learning algorithms. It contains implementation of clustering algo-rithms, classification, and many others. To enable clustering, it uses ApacheHadoop [2].

MLlib2 is part of the Apache Spark project that implements several machine learn-ing algorithms. Among the groups of algorithms that it implements we can findclassification, regression, clustering and dimensionality reduction algorithms [30].

FlinkML3 is the Flink machine learning library. It’s goal is to provide scalablemachine learning algorithms. It provides tools to help the design of machinelearning systems [2].

Table 1 describes the libraries presented above. For each one, the supportedprogramming language is detailed:

Table 1. Machine learning tool by language

Tool Java Scala Python R

Apache Mahout x

Spark MLlib x x x x

FlinkML x x

As shown above, the most used language is Java, which shows a tendencywhen implementing systems that integrate machine learning techniques.

Table 2 describes each library and the support for scaling when working withlarge volumes of data:

Table 2. Machine learning tool scaling

Tool H+V Multicore GPU Cluster

Apache Mahout x x x

Spark MLlib x x x x

FlinkML x x x

1 http://mahout.apache.org/.2 http://spark.apache.org/mllib/.3 https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/libs/ml/index.html.

http://mahout.apache.org/

http://spark.apache.org/mllib/

https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/libs/ml/index.html


In the table we can see that the scaling techniques are combined with othersto obtain platforms with better performance.

2.2 Platforms for Big Data Processing

Big Data is defined as a large and complex collection of data, which are difficultto process by a relational database system. The typical size of the data, in thistype of problems, is of the order of tera or peta bytes and they are in constantgrowth [23].

There are many platforms to work with Big Data, among the most popularwe can mention Hadoop, Spark (two open source projects from the ApacheFoundation) and MapReduce.

MapReduce [9] is a programming model oriented to the process of large volumesof data [16]. Problems addressed using MapReduce should be problems that canbe separated into small tasks to be processed in parallel [26]. This programmingmodel is adopted by several specific implementations (platforms) like ApacheHadoop and Spark.

Apache Hadoop4 is an open source project sponsored by the Apache SoftwareFundation. It allows distributed processing of large volumes of data in a cluster[21]. It consists of three fundamental components [19]: HDFS (storage compo-nent), YARN (resource planning component) and MapReduce.

Apache Spark5 is a cluster computing system based on the MapReduce concept.Spark supports interactive computing and its main objective is to provide highperformance while maintaining resources and calculations in memory [29].

H2O6 is an open source framework that provides libraries for parallel process-

ing, information analysis and machine learning along with data processing andevaluation tools [29].

Apache Storm7 is an open source system for real time distributed computing. Anapplication created with Storm is designed as a directed acyclic graph (DAG)topology.

Apache Flink8 is an open source platform for batch an stream processing.Table 3 compares the platform with the type of processing supported and the

storage system used.As can be seen, all the platforms support all varieties of storage. However in

large implementations the local storage (or pseudo-cluster) is not an option toconsider.

4 http://hadoop.apache.org/.5 http://spark.apache.org/.6 http://www.h2o.ai/h2o/.7 http://storm.apache.org/.8 https://flink.apache.org/.

http://hadoop.apache.org/

http://spark.apache.org/

http://www.h2o.ai/h2o/

http://storm.apache.org/

https://flink.apache.org/


Table 3. Platform scaling

Scaling Storage

Tool HV Multicore GPU Cluster IMa Local DFSb

Apache Hadoop x x x x x

Apache Spark x x x x x x x

H2O x x xc x x x x

Apache Storm x x x x x x

Apache Flink x x x x x xaIn-memorybDistributed File System.cThrough Deep Water (http://www.h2o.ai/deep-water/) in beta state.

3 Machine Learning for Big Data Review

This section proposes a classification of research works that use Big Data plat-forms to apply machine learning techniques based on three criteria: language(programming language supported), scaling (scalability) and storage (supportedstorage type).

The combination of these three factors allows the proper selection of a plat-form to integrate machine learning techniques with Big Data. Scaling is relatedto the type of storage. In the case of using horizontal scaling, a distributed filesystem must be used.

Table 4 shows the classification based on the above mentioned criteria.

3.1 Language

The supported programming language is an important aspect when selecting aplatform. Taking into account that this work is oriented to platforms that are notend user (tools like Weka9, RapidMiner10, etc. are not considered), the followingprogramming languages are considered: Java, Scala, Python, R.

The most common language in this type of implementations is Java. Howeverover time appeared layers of software that abstract access to certain technologies.With these news technologies is possible to use the platforms described in thisdocument with different programming languages.

3.2 Scaling

Scaling focuses on the possibility of including more process nodes (horizontalscaling) or include parallel processing within the same node (vertical scaling) byusing of graphics cards. This document consider horizontal, vertical, cluster andGPU scaling.9 http://www.cs.waikato.ac.nz/ml/weka/.

10 https://rapidminer.com/.

http://www.h2o.ai/deep-water/

http://www.cs.waikato.ac.nz/ml/weka/

https://rapidminer.com/


Table 4. Language-scaling-storage LSS classification

Articlea Language Scaling Storage

Al-Jarrah et al. [1] V DFS, IM

Aridhi and Mephu [2] Scala, Java Cluster DFS, IM

Armbrust et al. [3] Scala, Java, Python Cluster Local, DFS, IM

Bertolucci et al. [4] Java Cores, Cluster DFS, IM

Borthakur [5] Java Cluster Local, DFS

Castillo et al. [6] Cluster

Catanzaro et al. [7] Cluster, GPU DFS, IM

Crawford et al. [8] Scala, Java, Python Loca,IM

Dhingra and Professor [10] Java Cluster Local, DFS

Fan and Bifet [11] Java Cluser, GPU DFS

Gandomi and Haider [12] Vertical

Ghemawat et al. [13] Cluster DFS

Hafez et al. [15] Scala, Java, Python Cluster DFS, IM

Hashem et al. [16] Java H, V, Cluster, GPU DFS, IM

He et al. [17] V, Cluster DFS, IM

Hodge et al. [18] Scala, Java Cluster DFS

Issa and Figueira [20] Java, Python H, Cluster DFS, IM

Jackson et al. [21] Java, Python Cluster DFS

Jain and Bhatnagar [22] Java Cluster DFS

Jiang et al. [23] Java, Python Cluster, GPU DFS, IM

Kacfah Emani et al. [24] Scala, Java Cluster DFS, IM

Kamishima and M. [25]

Kiran et al. [26] Java Cluster DFS, IM

Kraska et al. [28] Scala Cluster DFS, IM

Landset et al. [29] Scala, Java DFS, IM

Meng et al. [30] Scala, Java Cluster DFS

Modha and Spangler [31] Cluster DFS

Naimur Rahman et al. [32] Java Cluster DFS

Namiot [33] Scala, Java Cluster DFS

Norman et al. [35] Cluster DFS

Paakkonen [37] Scala, Java, Python Cluster DFS

Ramırez-Gallego et al. [38] Scala Core, Cluster DFS

Saecker and Markl [39] H, V, Cluster, GPU DFS

Salloum et al. [40] Scala, Java Cluster, Paralell DFS

Saraladevi et al. [41] Java Cluster DFS

Seminario and Wilson [42] Java Cluster Local, IM, DFS

Singh and Reddy [43] Scala, Java Local, IM, DFS

Singh and Kaur [44] Java H, V, Cluster DFS

Walunj and Sadafale [45] Java Local, DFS

Zaharia et al. [46] Scala, Java Cluster DFSaSome articles only deal one of the criteria mentioned above, but in any case all the articlesare included in the table.


As can be seen in Table 4, and in the above tables, all the platforms scalehorizontal by using a DFS that in most cases corresponds to Hadoop. Howeverplatforms that work with data in memory are becoming more popular (likeSpark).

3.3 Storage

Depending on the amount of information and the strategy used to process theinformation, it is possible to decide the type of storage: local, DFS, in-memory.Each type involves the implementation of hardware infrastructure to be used.For example, in the case of a cluster implementation, it is necessary to haveadequate equipment that will constitute the cluster nodes.

For proof of concepts it is acceptable to use a pseudo-cluster that simulatesa cluster environment on a single machine.

4 Conclusions

In this document, a brief review on the machine learning techniques was carriedout, orienting them to the processing of large volumes of data. Then, some ofthe platforms that implement several of the techniques described in the docu-ment were analyzed. Most of the reviewed articles use techniques, languages andscaling described in this document.

In general, we can conclude that all these techniques are scalable in one wayor another. It is possible to start from a local architecture (or pseudo-cluster) fora proof of concept test and progressively scale to more complex architectures likea cluster, evidencing the need for distributed storage (DFS). The programminglanguage should not be a factor when selecting a platform, since nowadays thereare interfaces that allow the use of the mentioned platforms in a growing varietyof languages.

Future work may be proposed to perform a similar study considering thetime factor. This factor determines two forms of processing: batch and onlineprocessing. Finally, streaming processing could be incorporated into the study,which is a fundamental part of some of the platforms described in this document.

Acknowledgements. This work has been funded by the Spanish GovernmentTIN2016-76515-R grant for the COMBAHO project, supported with Feder funds.

References

1. Al-Jarrah, O.Y., Yoo, P.D., Muhaidat, S., Karagiannidis, G.K., Taha, K.: Efficientmachine learning for big data: a review. Big Data Res. 2(3), 87–93 (2015)

2. Aridhi, S., Mephu, E.: Big graph mining: frameworks and techniques. Big DataRes. 6, 1–10 (2016)


3. Armbrust, M., Ghodsi, A., Zaharia, M., Xin, R.S., Lian, C., Huai, Y., Liu, D.,Bradley, J.K., Meng, X., Kaftan, T., Franklin, M.J.: Spark SQL: relational dataprocessing in spark michael. In: Proceedings of the ACM SIGMOD InternationalConference on Management of Data - SIGMOD 2015, pp. 1383–1394 (2015)

4. Bertolucci, M., Carlini, E., Dazzi, P., Lulli, A., Ricci, L.: Static and dynamic bigdata partitioning on apache spark. Adv. Parallel Comput. 27, 489–498 (2016)

5. Borthakur, D.: HDFS architecture guide. Hadoop Apache Project, 1–13 (2008).https://hadoop.apache.org/docs/r1.2.1/hdfs design.pdf

6. Castillo, S.J.L., del Castillo, J.R.F., Sotos, L.G.: Algorithms of machine learningfor K-clustering. In: Demazeau, Y., et al. (eds.) Trends in Practical Applications ofAgents and Multiagent Systems. AISC, vol. 71, pp. 443–452. Springer, Heidelberg(2010)

7. Catanzaro, B., Sundaram, N., Keutzer, K.: Fast support vector machine trainingand classication on graphics processors. In: Machine Learning, pp. 104–111 (2008)

8. Crawford, M., Khoshgoftaar, T.M., Prusa, J.D., Richter, A.N., Al Najada, H.:Survey of review spam detection using machine learning techniques. J. Big Data2(1), 23 (2015)

9. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters.In: Proceedings of 6th Symposium on Operating Systems Design and Implemen-tation, pp. 137–149 (2004)

10. Nagina, Dhingra, S.: Scheduling algorithms in big data: a survey. Int. J. Eng.Comput. Sci. 5(8), 17737–17743 (2016)

11. Fan, W., Bifet, A.: Mining big data: current status, and forecast to the future.ACM SIGKDD Explor. Newslett. 14(2), 1–5 (2013)

12. Gandomi, A., Haider, M.: Beyond the hype: big data concepts, methods, and ana-lytics. Int. J. Inf. Manag. 35(2), 137–144 (2015)

13. Ghemawat, S., Gobioff, H., Leung, S.: Google file system (2003)14. Guller, M.: Big Data Analytics with Spark (2015). ISBN 978148420965315. Hafez, M.M., Shehab, M.E., El Fakharany, E., Abdel Ghfar Hegazy, A.E.F.: Effec-

tive selection of machine learning algorithms for big data analytics using apachespark. In: Hassanien, A.E., Shaalan, K., Gaber, T., Azar, A.T., Tolba, M.F. (eds.)AISI 2016. AISC, vol. 533, pp. 692–704. Springer, Cham (2017). doi:10.1007/978-3-319-48308-5 66

16. Hashem, I.A.T., Anuar, N.B., Gani, A., Yaqoob, I., Xia, F., Khan, S.U.: MapRe-duce: review and open challenges. Scientometrics 109(1), 1–34 (2016)

17. He, Q., Li, N., Luo, W.J., Shi, Z.Z.: A survey of machine learning for big dataprocessing. Moshi Shibie yu Rengong Zhineng/Pattern Recogn. Artif. Intell. 27(4),327–336 (2014)

18. Hodge, V.J., Keefe, S.O., Austin, J.: Hadoop neural network for parallel and dis-tributed feature selection. Neural Netw. 78, 24–35 (2016)

19. Holmes, A.: Hadoop in Practice. Manning, 2nd edn. (2015). ISBN 978161729222420. Issa, J., Figueira, S.: Hadoop and memcached: performance and power characteri-

zation and analysis. J. Cloud Comput.: Adv. Syst. Appl. 1(1), 10 (2012)21. Jackson, J.C., Vijayakumar, V., Quadir, M.A., Bharathi, C.: Survey on program-

ming models and environments for cluster, cloud, and grid computing that defendsbig data. Procedia Comput. Sci. 50, 517–523 (2015)

22. Jain, A., Bhatnagar, V.: Crime data analysis using pig with Hadoop. Phys. Pro-cedia 78(December 2015), 571–578 (2016)

23. Jiang, H., Chen, Y., Qiao, Z., Weng, T.H., Li, K.C.: Scaling up MapReduce-basedbig data processing on multi-GPU systems. Cluster Comput. 18(1), 369–383 (2015)

https://hadoop.apache.org/docs/r1.2.1/hdfs_design.pdf

http://dx.doi.org/10.1007/978-3-319-48308-5_66

http://dx.doi.org/10.1007/978-3-319-48308-5_66


24. Kacfah Emani, C., Cullot, N., Nicolle, C.: Understandable big data: a survey.Comput. Sci. Rev. 17, 70–81 (2015)

25. Kamishima, T., Motoyoshi, F.: Learning from cluster examples. Mach. Learn.53(3), 199–233 (2003)

26. Kiran, M., Kumar, A., Mukherjee, S., Ravi Prakash, G.: Verification and validationof MapReduce program model for parallel support vector machine algorithm onHadoop cluster. Int. Conf. Adv. Comput. Communi. Syst. (ICACCS) 4(3), 317–325(2013)

27. Kirk, D., Hwu, W.-M.W.: Processors, Programming Massively Parallel: A Hands-on Approach (2010). ISBN 0123814723

28. Kraska, T., Talwalkar, A., Duchi, J., Griffith, R., Franklin, M., Jordan, M.: MLbase:a distributed machine-learning system. In: 6th Biennial Conference on InnovativeData Systems Research (CIDR 2013) (2013)

29. Landset, S., Khoshgoftaar, T.M., Richter, A.N., Hasanin, T.: A survey of opensource tools for machine learning with big data in the Hadoop ecosystem. J. BigData 2(1), 24 (2015)

30. Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., Freeman,J., Tsai, D.B., Amde, M., Owen, S., Xin, D., Xin, R., Franklin, M.J., Zadeh, R.,Zaharia, M., Talwalkar, A.: MLlib: machine learning in apache spark. J. Mach.Learn. Res. 17, 1–7 (2016)

31. Modha, D.S., Spangler, W.S.: Feature weighting in k-means clustering. Mach.Learn. 52(3), 217–237 (2003)

32. Naimur Rahman, M., Esmailpour, A., Zhao, J.: Machine learning with big data anefficient electricity generation forecasting system. Big Data Res. 5, 9–15 (2016)

33. Namiot, D.: On big data stream processing. Int. J. Open Inf. Technol. 3(8), 48–51(2015)

34. Nguyen, T.T.T., Armitage, G.: A survey of techniques for internet traffic classifi-cation using machine learning. IEEE Commun. Surveys Tutorials 10(4) (2008)

35. Spangenberg, N., Roth, M., Franczyk, B.: Evaluating new approaches of big dataanalytics frameworks. In: Abramowicz, W. (ed.) BIS 2015. LNBIP, vol. 208, pp.28–37. Springer, Cham (2015). doi:10.1007/978-3-319-19027-3 3

36. Owen, S., Anil, R., Dunning, T., Friedman, E.: Mahout in Action. Manning (2012).ISBN 9781935182689

37. Paakkonen, P.: Feasibility analysis of AsterixDB and Spark streaming with Cas-sandra for stream-based processing. J. Big Data 3(1), 6 (2016)

38. Ramırez-Gallego, S., Mourino-Talın, H., Martınez-Rego, D., Bolon-Canedo, V.,Benitez, J.M., Alonso-Betanzos, A., Herrera, F.: Un Framework de Seleccion deCaracterısticas basado en la Teorıa de la Informacion para Big Data sobre ApacheSpark

39. Saecker, M., Markl, V.: Big data analytics on modern hardware architectures:a technology survey. In: Aufaure, M.-A., Zimanyi, E. (eds.) Business Intelli-gence. LNBIP, vol. 138, pp. 125–149. Springer, Heidelberg (2013). doi:10.1007/978-3-642-36318-4 6

40. Salloum, S., Dautov, R., Chen, X., Peng, P.X., Huang, J.Z.: Big data analytics onApache Spark. Int. J. Data Sci. Anal. 1(3), 145–164 (2016)

41. Saraladevi, B., Pazhaniraja, N., Paul, P.V., Basha, M.S.S., Dhavachelvan, P.: Bigdata and Hadoop-a study in security perspective. Procedia Comput. Sci. 50, 596–601 (2015)

42. Seminario, C.E., Wilson, D.C.: Case study evaluation of mahout as a recommenderplatform. CEUR Workshop Proc. 910(September 2012), 45–50 (2012)

http://dx.doi.org/10.1007/978-3-319-19027-3_3

http://dx.doi.org/10.1007/978-3-642-36318-4_6

http://dx.doi.org/10.1007/978-3-642-36318-4_6


43. Singh, D., Reddy, C.K.: A survey on platforms for big data analytics. J. Big Data2(1), 8 (2015)

44. Singh, R., Kaur, P.J.: Analyzing performance of Apache Tez and MapReduce withHadoop multinode cluster on Amazon cloud. J. Big Data 3(1), 19 (2016)

45. Walunj, S.G., Sadafale, K.: An online recommendation system for e-commercebased on apache mahout framework. In: Proceedings of the Annual Conference onComputers and People Research, pp. 153–158 (2013)

46. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: clustercomputing with working sets. In: HotCloud 2010 Proceedings of the 2nd USENIXConference on Hot Topics in Cloud Computing, p. 10 (2010)

Documents

Big Data Infrastructure: A Survey data...Language-scaling-storage LSS classiﬁcation Articlea Language Scaling Storage Al-Jarrah et al. [1] V DFS, IM Aridhi and Mephu [2] Scala, Java