International Journal of Big Data (ISSN 2326- 442X) Vol. 2 ... · International Journal of Big Data (ISSN 2326- 442X) Vol. 2, No. 4, 2015 1 BDOA: BIG DATA OPEN ARCHITECTURE Version

International Journal of Big Data (ISSN 2326-442X) Vol. 2, No. 4, 2015 1

BDOA: BIG DATA OPEN ARCHITECTURE

Version 1.0 Liang-Jie Zhang and Huan Chen (eds.)

Services Society (S2)

E-mail: [email protected]

Abstract The initiatives of drafting the Big Data Open Architecture (BDOA) started in the late 2014 and Services Society (S2) formally announced the Body of Knowledge of Big Data (BoK) project on the conference of IEEE SERVICES 2015. Many volunteers from all over the world joined the BoK project and made several comments and suggestions. At the end of 2015, the Big Data technologies are no longer brand new to academic researchers as well as to the industrial world, but have gradually become mature. In the last five years, it is interesting that almost every technology vendors and startups talk about Big Data. Subsequently, it looks like that both the scientific world and industry world do expect a Big Data open architecture.

The paper proposes the Big Data Open Architecture (BDOA). We hope that the detailed definition and explanation of the high-level architecture building blocks could facilitate researchers, engineers, teachers, and students not only to have an overview of BDOA but also to bestow much insights into Big Data. In the paper, we first have an overview of the open architecture. Secondly, we illustrate the high-level Architecture Building Blocks (ABBs) of each level of the open architecture followed by BDOA-based engineering practices. Finally, we summarize the paper and shed some light on future research directions.

Keywords: Big Data, open architecture, reference architecture, BoK

__________________________________________________________________________________________________________________

1. INTRODUCTION

The project of BDOA started in the late 2014. Itsmain purposes are to assist researchers, engineers, students, teachers and anyone who work in Big Data related areas in capturing a view on Big Data knowledge areas in terms of solution, hot topics, difficult issues, and future research directions. The paper introduces Big Data Open Architecture (BDOA), as well as its background and many related key open source solutions. We hope that the readers not only can understand the full scope of Big Data but also can take our engineering practices as reference.

With the power of Service-Oriented Architecture (SOA), cloud computing open architecture CCOA (L.-J Zhang, 2009) and services computing techniques (L.-J. Zhang, 2007) have been widely developed and deployed. It should be noticed that the cloud computing, at its essence, is the underlying

infrastructure of any development of Big Data system. In the recent five years, the noun ‘Big Data’ has gained rapid popularity, and it is increasingly applied in versatile working fields and not limited in Information Technology area.

‘Big Data’ has one key ‘V’- ‘value’, that is the value density of the data. Although people has been putting efforts in developing various technologies to discover the potential value of massive data, it is still not easy to straightforwardly drill the useful information and knowledge. Because Big Data covers an extremely wide range of knowledge, including some of traditional information technologies, cloud computing techniques, business operation skills, privacy protections, and many others, it is difficult for an expert of Big Data to define the whole scope of Big Data. In addition, it is even harder for those who did not work originally in the IT era to participate in the Big Data research. As a result, Services Society (S2) proposes the BDOA allowing many global experts and engineers to share their knowledge about Big Data from various viewpoints.


The paper is organized as follows. The next section proposes an open architecture of Big Data and illustrates the high-level Architecture Building Blocks (ABBs) and the corresponding open source

solutions. Section 3 performs two case studies of BDOA. Section 4 concludes the paper and shed some light on future research directions.

2. BIG DATA OPEN ARCHITECTURE

This section highlights the high-level Architecture Building Blocks (ABBs) of the Big Data Open Architecture.

In an IT solution, the value of all resources can be realized in terms of services which could be exposes as an external service offering. As a developing concept, Big Data covers all aspects of supporting technologies, business models and applications. The supporting technologies include but not limited to data-driven storage, networking, computing infrastructure and software-defined technologies. The business models contain innovations of three types of data sources: self-owned data, third-party data, and data marketplaces. The Big Data applications comprise consumer-centric applications, enterprise-centric applications, and hybrid applications. An open architecture of Big Data is presented as Figure 1, we named it Big Data Open Architecture (BDOA).

As shown in Figure 1, there are three types of enablement parts for Big Data solution. Part A is called technology category that includes Big Data Infrastructure (shown as Layer 1) and Big Data Processing Components (shown as Layer 2). Part C is related to business applications which include Big Data Enabled Business Process (shown as Layer 4) and Big Data Interactions (shown as Layer 5). Part B, which bridges part A and C, is Big Data Services layer (shown as Layer 3) which provides reusable services and open APIs for composing scenario-driven business processes.

From the perspective of solution, Big Data possesses many data sources, such as user generated data, device generated data and program generated data. Big Data Sources & Integration layer (shown as Layer 6) is responsible for integrating all kinds of data sources, technology elements and business elements. Big Data Security & Privacy layer (shown as Layer 7) deals with the complete lifecycle of security and privacy managements which covers modeling, enablement, monitoring and refinements. The solution’s quality is enabled by the Big Data QoS layer (show as Layer 8) which covers two

Figure 1. Big Data Open Architecture


aspects, functional service quality and non-functional quality. Big Data Ecosystem Governance & Management layer (shown as Layer 9) refers to the consensus and agreements on the organization structure, policy and rules, which is crucial for an organization to create and enable a Big Data solution.

In the remainder of the paper, high-level ABBs are with underlined bold font. Sub ABBs are with italic underlined font. ABBs are started with no indent.

2.1 LAYER ONE: BIG DATA INFRASTRUCTURE

The infrastructure supports data collection, storage, networking, computing, processing, and management of Big Data. It includes sensors and sensor networks for data collection, distributed storage system for data storage with massive capacity (or a capacity that can be scaled up with almost no hard limit), high throughput data networking that connects distributed Big Data resources with broad bandwidth and low latency, huge computing resources that can massively work with Big Data processing in parallel, and data management that manages the Big Data collection, processing, transportation, networking, storage, computing, etc..

Cloud: A computing environment that provides online access to a large pool of shared resources. It includes many physical computing servers and a distributed data storage space and can be scaled up dynamically to support large number of users (tenants).

Public Cloud: A publically available cloud computing infrastructure that provide online access to a large pool of shared resources It can conduct the data computing, networking, storage, etc., though a pay-by-usage model.

Private Cloud: A Cloud computing infrastructure that provide private online access to a large pool of shared resources of networking, computing, and storage that are dynamically available for data processing and services.

Hybrid Cloud: A cloud computing infrastructure that combines resources from private cloud and public

cloud for better scalability, capacity, resource utilization, flexibility, availability, and service quality.

Virtualization is one of the essential targets to be achieved by cloud infrastructure on Layer 1. It has three main branches: Full-Virtualization (FV), Para-Virtualization (PV) and containerization. Typical tools of open-source FV, PV and containerization include KVM (KVM, 2015), Xen (Xen, 2015), OpenVZ (OpenVZ, 2015) and Docker (Docker, 2015). In the light of virtualization, the main advantages of using multi-level virtualization are with economic efficiency and better resource isolation.

Docker (Docker, 2015) has gained popularity in the recent three years. With the help of the following tools, a new terminology ‘CaaS – Container as a Service’ occurs. CaaS allows continuous delivery, which satisfies the new methodology called ‘DevOps – Development and Operation’. Some open source tools are of great importance to the realization of CaaS, including Marathon (Marathon, 2015), Mesos (Mesos, 2015, Kubernetes (Kubernetes, 2015), Shipyard (Shipyard, 2015), DockerUI (DockerUI, 2015), Fleet (Fleet, 2015) and Swarm (Swarm, 2015). Actually, CaaS is at its essence, a PaaS platform, the major components of which belongs to Layer 1.

Storage: A data storage that provides online access to a huge distributed storage space with a capacity that can be scaled up with almost no hard limit for Big Data processing.

Parallel Database System: Parallel database system suits for storing massive structured data. It uses multiple CPUs and disks in parallel, aiming to improve the overall performance. In this kind of system, data loading, index building and query evaluation are processed simultaneously. Most parallel database systems support traditional SQL queries.

NoSQL System: There is no explicit definition of ‘NoSQL’. It is referred to as ‘Not only SQL’ normally. The family of NoSQL system has grown rapidly in the last decade. This kind of systems includes K/V databases, documented-oriented database, column-oriented database, and graph database.


Distributed File System (DFS): A file system in which data storage is distributed to multiple file systems for scalability, throughputs, elimination of single point of failure, flexibility and precaution against hardware failures, etc. Although DFS can be used for storing any type of data, it is relatively more suitable for unstructured data. In the last couple of years, versatile DFS systems appeared. However, some remaining shortcomings still cannot be avoided in these DFS versions. From the perspective of solution, a DFS should maximize the features that a file system can offer. As a result, people often enhance DFS with more features to improve the usability. Typical features involve low latency access, suitable for small files, support of multi-tenant writing and editing, multi-IDC file synchronization.

After the introduction on the main categories of storage, it should be noticed that storage engines of a RDBMS (Relational Database Management System) should be categorized into this layer. MySQL (MySQL, 2015) and PostgreSQL (PostgreSQL, 2015) are the most well-known open source database systems. SQL Server (SQLServer, 2015) and Oracle database (Oracle Database, 2015) are the widely-used commercial ones.

As an emerging category, ‘NoSQL’ databases (DB), where “NoSQL” could be rephrased as ‘No SQL’ or ‘Not only SQL’. NoSQL DB family fits the Big Data era quite well, in particular, the document-oriented database MongoDB (MongoDB, 2015), XML database: BaseX (BaseX, 2015) and Apache CouchDB (Apache CouchDB, 2015).

Some of the above data storage systems have the ability to be distributed deployed, for examples, Hadoop HDFS (Apache Hadoop, 2015), Apache HBase (Apache HBase, 2015) and Apache Cassandra (Apache Cassandra, 2015). In addition, NoSQL family also contains some Key/Value databases such as Redis (Redis, 2015), Memcached (Memcached, 2015) and MongoDB (MongoDB, 2015).

NoSQL systems have a large amount of redundancy. Some platforms, for example, Apache Parquet (Apache Parquet, 2015), have been devoted to optimize the storage format hence to get compressed and easy-to-use data.

Recently, there is new category named ‘graph database’, which uses graph structures for semantic queries with nodes, edges and properties to store and

represent data. Typical ones are Google Cayley (Cayley, 2015) and Neo4j (Neo4j, 2015).

The functionalities of databases do not have clear boundaries. We could use CAP Theorem (Brewer, E., 2012). to classify the databases, shown as Figure 2.

Figure 2. Classify DBs by CAP Theorem

Software Defined Technologies: A technical paradigm that separates the hardware from the software that manages the task infrastructure, making its functionalities into services that can be programmed, composed, orchestrated, and utilized in various applications. The entire plane of software-defined environment can be captured by Figure 3.

We would like to emphasize the software defined technologies since they have been more mature in the last three years. Openstack (Openstack, 2015) is an open source software for creating private and public clouds. It could also facilitate development of SDDC, SDS and SDN.

Figure 3. Software-Defined Environment

Software-Defined Environment (SDE): An environment in which all resources are based on software-defined technologies. The applications can


fully utilize the resources in the environment with flexibility and extensibility for optimal results.

Software-Defined Data Center (SDDC): An IT infrastructure for IT-as-a-Service, in which all elements of the infrastructure including networking, storage, CPU, and security, are delivered as a service, generally through the means of virtualization, containerization, and separation of hardware from the software that manages the infrastructure. SDDC can provide enhanced flexibility, extensibility, efficiency, and manageability for Big Data processing.

Software-Defined Networking (SDN): A data networking paradigm that separates control and data plane in data networking for flexibility, programmability, innovation, bandwidth utilization, etc. The logically centralized controller in SDN can provide a holistic view of the entire network for globally optimized network traffic control, which is especially valuable for Big Data applications.

SDN is a major part of CaaS. In the CaaS environment, typical SDN solutions include but not limited to Pipework (Pipework, 2015), Weave (Weave, 2015), Flannel (Flannel, 2015) and SocketPlane (SockerPlan, 2015). These solutions allow a quick realization of SDN for CaaS.

Software-Defined Storage (SDS): A data storage paradigm that separates hardware from the software that manages the storage infrastructure, usually through a form of storage virtualization to enable flexible and efficient storage resource management, thin provisioning, replication, deduplication, snapshots, etc.

Content Delivery Network (CDN): A network with distributed servers deployed in multiple Internet Data Center (IDC). It is used to serve the source contents to end-users with high performance and high availability. Generally, the end-user can obtain the corresponding source contents from the servers which are relatively ‘closer’ to them.

Legacy Data System: Legacy data is of high historical values. The data is often stored in relatively old or cold devices, for example, those legacy devices. By means of registering legacy devices, authorizing legacy devices, developing algorithms for batch transferring and importing legacy data, the

infrastructure layer should provide both hard and soft IOs to access legacy data.

2.2 LAYER TWO: BIG DATA PROCESSING COMPONENTS

Acquisition: The first step of Big Data processing, which mainly deals with collection and transmission. Therefore, the knowledge area of Big Data acquisition includes two sub-areas, collection and transmission.

Collection: Big Data collection involves how the raw data from different data generation environments is collected. Raw data can be collected through many methods. The machine log files are automatically generated by data source systems such as web server. The sensing data are collected via wired or wireless sensor networks; Web crawler is used by search engine to craw data on the Internet.

Transmission: Big Data transmission is about transmitting the collected raw data to a data storage infrastructure for further processing and analysis. It’s a great challenge to move the huge volume of data from a data source into a data center, not to mention it also requires a quick and efficient way. The main knowledge of Big Data transmission includes the transmission mechanism of the Big Data, communication protocol, data compression algorithm and data hub.

Cleaning: The knowledge area of Big Data cleaning refers to data processing so as to enable effective data analysis. It includes four sub knowledge areas, policy and rules, integration, correction, and redundancy elimination.

Policy and Rules: At the macroeconomic level, Big Data has less to do with technology than policy. As deliberate social actions policies are aimed to create certain conditions that benefit certain groups or foster desired societal arrangements. A well-documented example of a policy-driven action to stimulate market growth is Open Government Data. There are also some rules in enterprise data.

Integration: Big Data integration deals with the combination of data from different sources and provides an integrated representation. Methods for


data integration should meet the requirements of Big Data flows or search applications with high performance.

Correction: Big Data correction is responsible for identifying and fixing Big Data errors. It is regarded as a main challenge due to the increasing volume, velocity and variety of data in many applications. Methods for Big Data correction should support inspection of data formats, completeness, rationality, and restriction.

Redundancy Elimination: Big Data redundancy elimination takes charge of decreasing the unnecessary data expense and defects during the transmission on storage systems, such as waste of storage space, data inconsistency, reduction of data reliability.

Modeling: The knowledge area of modeling includes six second-level knowledge areas big data business understanding/insights, big data storage modeling, big data transaction modeling and versioning, modeling algorithms and evaluation as well as machine learning and predictive models.

Big Data Business Understanding/Insights: This knowledge area is about understanding the business needs of Big Data, analyzing the factors affecting the business results, extracting the key business features and analysis dimensions, defining the scope of Big Data analysis, acquiring the meaning and characteristics of the data, and determining the goal of Big Data modeling.

Big Data Storage Modeling: This knowledge area deals with the modeling structured, semi-structured and unstructured data, including storage models such as spreadsheets, documents, graphs, key-values, and so on and analysis on the characteristics and differences in a variety of data storage model.

Big Data Transaction Modeling and Versioning: This knowledge area involves the transaction modeling of Big Data, CAP theory, BASE principle, strong consistency and weak consistency principle, as well as data versioning management.

Modeling Algorithms and Evaluation: This knowledge area is about the selection of the

appropriate algorithm to establish a data analysis model evaluation on the model and verification of the model, namely, whether the model meet the business objectives.

Machine Learning and Predictive Models: This knowledge area involves the main methods of machine learning and predictive models, such as the Decision Tree, Extreme Learning Machine, Support Vector Machine.

Analytics: The area of analytics includes second-level knowledge areas, distributed computing principles, batch processing model, stream computing architecture, big data indexing, replication and recovery mechanism of Big Data, and interactive big data analytics.

Distributed Computing Principles: This knowledge area includes Big Data distributed computing principles, distributed resource management, scheduling mechanism and distributed algorithm.

Batch Processing Model: This part of knowledge includes computing model of large-scale data processing, as well as the application scenario and its limitation.

Stream Computing Architecture: The knowledge of stream computing architecture covers designing of an architecture for real-time stream data computing, real-time data storage model, real-time data processing methods and application scenarios.

Big Data Indexing: This part of knowledge contains Big Data indexing algorithm, storage model and updating mechanism.

Big Data Replication and Recovery Mechanism: The knowledge area involves error recovery mechanism of Big Data, consistency management, and election algorithm.

Interactive Big Data Analytics: As for the interactive data processing, the interactive Big Data analytics are very flexible, intuitive and easy to control. The system conducts the man-machine dialogue of question and answer with the operating personnel..

This layer mainly takes care of preprocessing and processing of Big Data. It includes the


technologies for online, near real-time and offline handing of data. The key technologies mainly come from the following open solutions, R language (R Project, 2015), Apache Hadoop (Apache Hadoop, 2015), Apache Spark (Apache Spark, 2015), Apache Storm (Apache Storm, 2015), Apache Mahout (Mahout, 2015), Oryx (Cloudera, Oryx, 2015), Scikit-Learn (Scikit, 2015), Weka (Weka, 2015), Orange (Orange, 2015), RapidMiner (RapidMiner, 2015), KEEL (KEEL, 2015), SPMF (SPMF, 2015) and Rattle (Rattle, 2015).

The famous ‘Hadoop’ has been released for more than ten years, during which, a large number of distributed reliable and scalable computing frameworks has been published and open sourced. These new systems are applied to different practical scenarios, though somewhat similar to Hadoop classical cases. In spite of this, the differences can be mainly identified by the following two metrics, data volume to be handled and processing velocity.

However, the ecosystem of the open source frameworks changes rapidly, and hence it is really hard to claim which one is better and faster. Generally, the application is determined by the practical scenarios. For example, the Spark has spark streaming, while the Apache Storm also has strong ability to handle streaming data. When we make a selection on the platform (or technology), we choose the one with dedicated properties and compatibility.

2.3 LAYER THREE: BIG DATA SERVICES

This layer focuses on enabling the above mentioned resources, such as storage, computing, analytics and virtualization resources (just as Model in MVC), and business logics (just as Control in MVC) that could be changed to available on-demand services (just as Viewer in MVC) based on SOA and cloud computing techniques. At the same time, this layer focuses on providing facilities to the Big Data services from service providers to service consumers, thus boosting an API economic development.

Big Data Service Enablement: It includes four sub-ABBs: storage enablement service, computing enablement service, analytics enablement service and visualization enablement service.

Storage Enablement Service: It focuses on enabling Big Data storage resource as on-demand service to those who use the service to storage the huge amounts of multi-source heterogeneous data. Different from building users’ own storage infrastructures, taking advantages of Big Data storage service could ease the difficulty level of Big Data storage. Consequently, users do not need to do anything about it except pay for the usage.

Computing Enablement Service: It focuses on turning computing resource or cluster computing resource to Big Data computing service. The Big Data computing service we mentioned has obvious differences and relationships from the concept of Elastic Compute Service (ECS) in cloud computing, the latter always provides services as virtual machines while the former provides pure computing framework or other environments to Big Data analysis and parallel process. It should be noticed that the computing framework or environment could be build based on the ECS.

Analytic Enablement Service: It focuses on establishing analytic service for Big Data. Based on Big Data storage resource/service and Big Data computing resource/service, the Big Data analytic service provides analytical algorithm for users to mine their Big Data, turn the dense Big Data with less value to high valued information. Due to that the purposes would be different, there could be some components such as models and algorithms base to enable various analytical algorithm provisioning. In practice, such services are often open to public, that is to say, end-users are allowed to use the data analytic services for free.

Visualization Enablement Service: It focuses on providing visualization service to render the useful information mined from Big Data and display in the human friendly way. Here the use of visualization service includes intuitive visual insight, interactive visual analysis and others. A suitable interactive visualization engine may be required to provide for tis enablement.

Big Data Service Provisioning: It includes three sub-ABBs, service provisioning platform, API gateway and API marketplace.


Service Provisioning Platform: It focuses on providing Big Data service form Big Data service provider to Big Data service consumer. It possesses the capability of Big Data service designing, implementing, securing and lifecycle managing, publishing and engaging. There are always three models of Big Data service, which are API model, website model, and hybrid model. Users could use APIs of the Big Data services to storage, access, compute and analyze, and could use website to visually manage the Big Data service.

API Gateway: API gateway is one type of middleware for Big Data service provisioning which it is used to manage API traffic, monitor and analyze API usage for Big Data services. It supports protocol transformation, data transformation, API composition and operational dashboards delivering, such as API availability, response times, success, and errors. It could react in real time to API calls patterns, for example, SLA breaches detecting trends in usage. It optimizes definitions and monetization models of SLA. It could also track key developers’ organizations.

API Marketplace: API marketplace creates a store of all available APIs for both Big Data service providers and consumers. API marketplace is similar to popular applications stores, where Big Data service providers could publish, price and sell their services through APIs, while Big Data service consumers could browse and search APIs by provider, tags, or names, subscribe to APIs and manage subscriptions on per-application basis.

Big Data itself could be as a service, namely DaaS (Data as a Service). In this layer, not only the data itself could be provided, but also the other resources like storage, computing, analytics, virtualization and business logics could be changed to as on-demand services. The leveraging of different types of services to form services marketplace is aimed to boost API economic development. The key technologies could be mapped to the following open solutions, such as WSO2 (WSO, 2015) and Mashape (Mashape, 2015).

2.4 LAYER FOUR: BIG DATA ENABLED BUSINESS PROCESS

Faced with the economic globalization in the era of Internet, more and more modern enterprises begin to attach importance to business processes. They build their own business process models, execution models and deployment process models, and mine the related real-time and historical Big Data (including logs, documents and social media) to optimize domain-specific business processes. Hence, there is an urgent demand for systematic and structured methods to manage, analyze, control and improve Big Data-enabled business processes, leading to better quality of products and services.

Big Data Enabled Scenarios: There are two types of business scenarios. The first one is to leverage Big Data to enrich a business process f so as to increase efficiency and effectiveness of business operations. For example, using Big Data analytics to marketing can transform regular marketing to digital marketing. The second one is to leverage business process to manage the lifecycle of Big Data in a specific business scenario based on various laws, rules, and policies defined in Layer 9.

Big Data boosts business processes from different aspects and scenarios. Nowadays, traditional business integrates Internet or Big Data, creating several cross-domain business areas.

Big Data Enabled Business Process Architecture: Defining the architecture of Big Data-enabled business processes is an important step of determining the data-driven process management is organized and designed. To put it simply, the Big Data-enabled business processes architecture of an enterprise involves the types of processes it contains and the relationships among them.

Big Data Enabled Business Process Management: It includes business process life-cycle management, process variation management, compliance management, business intelligence and process knowledge management, as well as process data repository mining.

Business Process Life-Cycle Management: Although life-cycle management is not a new topic of business processes, what we highlight is the impact of Big Data on managing the entire life-cycle of a business process, including design, modeling, execution,


analysis. In the life-cycle of a Big Data-enabled business process, there are four core topics to be discussed, namely, variation management, compliance management, knowledge management, and repository mining.

Process Variation Management: The Process Variation Management is responsible for all changes of a process and its components, such as add, deletion and update, as well as for their evolutionary information over time. It assures that business processes can evolve in an agile and flexible manner.

Compliance Management: Compliance management contains a set of polices, processes, structures and tools. It covers partly the internal control and risk management and finally helps identify and manage the risks so as to meet the enterprise targets and goals. With the boost of Big Data, such the management could possess data tools, polices, processes and structures embedded, and can have a detailed and general view on the risk control process of data, thus reducing the risks and associated cost in chasing the interests.

Business Intelligence and Knowledge Management: The primary goal of Big Data-enabled business processes is to enhance the efficiency and effectiveness of execution by means of business intelligence and knowledge work. It includes two parts: on the one hand, it enables the enterprise to dynamically adjust business strategies according to the execution of Big Data-enabled business processes; on the other hand, it can improve the existing business processes according to the strategies derived from Big Data analytics. Business intelligence is a matured area and there exist a number of open source tools for BI, including Spagobi (Spagobi, 2015), Birt (Birt, 2015), KNIME (KNIME, 2015), R language (R Project, 2015).

Process Data Repository Mining: The process data repository mining is an interesting and promising topic, aiming at exploring valuable information or knowledge to meet the requirements of domain-specific business, using data mining and machine learning techniques, to achieve user experience and satisfaction.

2.5 LAYER FIVE: BIG DATA INTERACTIONS

The layer of Big Data interaction initially imports Big Data from outside world and finally delivers the analytic results. On this layer, the roles of interactions can be human beings, computers and devices. People make use of methodologies of Big Data interactions, obtain and analyze interaction requirement, design and implement the visualization of Big Data, and finally present the analytic results and data effects with the help of Big Data devices.

The Methodology of Big Data Interactions: The methodology of Big Data interactions systematically describes the versatile design approaches used in the interaction design. To be specific, it includes the target of interaction design, the acquisition of requirement, interaction design methods, process of interaction design, prototyping, design patterns, usability evaluation and related subjects of interaction design.

Target of Interaction Design: Big Data interaction can be interpreted and defined as target-oriented interaction design. Different interaction designs may have slightly different definitions, and therefore the sub-goals of interaction design can be different from systems. Nevertheless, the ultimate target of Big Data interaction is to satisfy user’s maximum requirement and to help users achieve their goals, which often remains unchanged. It is necessary to emphasize that interaction design can only partially meet the targets.

Interaction Design Process: It includes the knowledge of each design step: requirement analysis – interact with users and find out the goals, scope, definitions and functionality of the new system; design – convert the user requirements into concrete contents of interaction requirement, as well as the concrete designs of user interfaces; assessment – search defects in every single step of the product design and interaction design, aiming to improve the process of Big Data interaction. Interaction design methods includes: 1) UCD – User Centered Design -- always focusing on user’s ultimate target and requirement. UCD assumes users fully understand their own requirement and therefore designers have to completely follow users’ targets; 2) ACD – Activity Centered Design, designers focus on concrete activities as well as tasks and then draw out


the interactions that is consistent with the behaviors of activities; 3) System Design – designers draw out the interaction obeying the details of each components in the system; 4) Genius Design – designers draw the interactions according to their own experiences and inspirations.

Prototyping: Prototyping is the process of converting abstract design concepts into concrete description. Prototyping includes the methodology of paper-based prototyping and methodology of wireframe-based prototyping.

Interaction Design Patterns: Interaction design patterns are sets of feasible and reusable principles, templates and solutions to commonly occurred interaction problems within given contexts of interaction design. Frequently-used design patterns include registration forms, progress indicators, search box, view of search results, scoring, reviewing, and help.

Usability Evaluation: Usability evaluation is used to evaluate whether the interaction design meets the design goals. ISO defines usability as “The extent to which a product can be used by specific users to achieve their goals with effectiveness, efficiency and satisfaction in a specific context of use.” Usability evaluation methods include usability testing, heuristic evaluation, cognitive walkthrough and action analysis.

Big Data Interaction Requirement: The analysis of Big Data interaction requirement is the process of problem analysis and problem definition. With the analysis on the interaction requirement, the contents of input/output data, and the form of input/output data will be defined. In general, the contents of requirements can be classified as functional requirement and non-functional requirement.

Functional Requirement: Functional requirement defines the must-have behaviors and functionalities in the system. Functional requirement includes the contents of functional requirement itself and the corresponding methodology of requirement analysis. Functional requirement includes but not limited to functionalities requirement, interfaces requirement and data requirement.

Non-Functional Requirement: Non-functional requirement describes those runtime requirements that are not in the category of functional requirement. The non-functional requirement includes the contents of non-functional requirement itself and the corresponding methodology of requirement analysis. Non-functional requirement includes but not limited to quality requirement, security requirement, cost requirement, documentation requirement, performance requirement, resources requirement, environment requirement as well as human and nature factors.

Big Data Interaction Device: Big Data interaction device is the final and direct connector between Big Data eco-system and human, also the connector between the eco-system and devices. Its main knowledge area includes the hardware design, software design, and embedded system design–related topics. The forms of interaction are mainly the inputting to devices or outputting from devices.

Interaction Hardware: Interaction devices includes but not limited to personal computers and its peripherals, mobile phones, wearable devices, augmented reality devices, virtual reality devices, image scanning devices, 3D information input devices, digital paper and other programmable clients.

Application of Interaction Devices: Big Data interaction devices can be categorized in accordance with its applications, for examples, sports and fitness, healthcare and wellness, security and prevention, game and lifestyle.

Big Data Visualization: Big Data visualization is the process of converting data sets into graphical views. The main goal of Big Data visualization is to efficiently and clearly communicates with end-users via selected graphics such as figures and tables.


Visualization Techniques: The basic concepts of Big Data visualization techniques include: 1) dataspace – a multi-dimensional information space constructed by multi-dimensional attributes and multiple elements; 2) data development – quantitative derivation and calculation of data sets; 3) data visualization – graphical presentation of Big Data sets; 4) data analysis – to view the data sets from different aspects and angles, by profiling multi-dimensional data via visualization tools and algorithms.

Visualization Tooling: Visualization tooling mainly includes the study and development of four features owned by visualization tools: usability, real-time,

expressiveness and compatibility.

This layer mainly includes four parts: methodology, requirement analysis, physical devices and visualization. It introduces several technologies that are not strongly and originally linked to the word ‘data’, for example, the ideas of interaction designs. We believe this outer layer is the key innovation since user experience is now considered as one the

most important factor of mobile applications. It is interesting that, visualization interfaces are now treated as the significant ‘access point’ of an application, including Big Data ones. We would like to highlight some useful open source tools to readers to establish their own BDOA-based interaction layer. These tools are D3js (D3js, 2015), ECharts (ECharts, 2015), Gnuplot (Gnuplot, 2015) and R language (R project, 2015).

In addition to the traditional visualization, there is new trend named ‘agile visualization’, which is trying to use computing resources to surpass the mind and cognitive facilities of human’s brain. Agile visualization aims at crafting the presentation of data in an extremely short production cycle. Table 1 compares the agile visualization with traditional visualization.

2.6 Layer Six: Big Data Sources & Integration

Big Data has several and sometimes unlimited data sources, where the data can be structured, semi-

Metrics Traditional Visualization Agile Visualization

Installation Quick, installation package is small, sometimes there is no installation process

Slow, installation package is large, software installation is required

Product form One tool from one tool provider. A series of tools from different tool providers.

Data source Flexible types of inputs Database, data warehouse, CUBE file

Modeling process

Do not based on Cube.

Extract data fields from data sources, and then combine them flexibly.

Based on Cube.

ETL from data sources, and then generate OLAP Cubes for presentation.

Performance High, because it is based on distributed architecture Sometimes is very slow

Flexibility High, can add or delete data fields easily

Low, because building a Cube for OLAP is time consuming and is not very agile

Implementation Cycle Short, sometimes by weeks Long, sometimes by months

Risk Low Medium

Table 1. Comparison between Traditional Visualization and Agile Visualization


structured or unstructured. Big Data sources provide the most fundamental inputs and elements for later production as well as business insights. The sources include but not limited to user-generated data, device-generated data and program-generated data. The integration layer synthesizes and assembles all kinds of data sources, technology elements and business elements. According to the data generated performance and period, the type of Big Data can also be categorized as real time stream data, near real-time data and offline data.

Big Data Sources: Data can be originated from user, device and program.

User-Generated Data: User generated data can be in the form of text written in mails, short messages, blogs, twitters, word documents etc. Music, video and symbols are the other kind of user generated data. However, the latter sort of unstructured data is still lacking effective processing and data mining methods due to its huge quantity and sparse structure.

Device-Generated Data: Device generated data can be in the form of sensor data, such as temperature, humidity, precipitation, light intensity, power, current,

voltage, force, density level, speed, pressure, quantity of heat, shake, position, angle, displacement, distance and accelerated speed. It can also be in the form of encrypted format or encapsulated data packages, such as those protocoled data generated by smart devices. Digital devices often generate structured data whereas analog devices often generate unstructured data. Semi-structured data can be from any kinds of data sources.

Program-Generated Data: Software/Hardware programs can continuously generate data, most of which would be structured, since it has to be interfaced to other APIs and IOs. The types of programs include but not limited to relational

databases, data warehouses, runtime loggers and software test tools.

Although Big Data has properties of ‘Variety’ and ‘Volume’, any types of data could be classified to user-generated, device-generated and program-generated types. These three key types of data sources could cover most applications.

In addition, as described above, the roles in Big Data ecosystem sometimes change frequently and rapidly. A data consumer could be a data provider/producer. Since the boundary of user and

Figure 5. Data Aggregation in Big Data era

Figure 4. Messaging System using Kafka as an Example


application is getting obscure, in the future, it may be hard to claim which is user-generated data and which is program-generated data. Nevertheless, the data are all generated from the three key types of Big Data ecosystem.

Big Data Integration: This vertical layer collects data from the technology layers of Big Data Infrastructure and Big Data Processing Components. It also collects data form the business category layers of Big Data Services, Big Data Enabled Business Process and Big Data Interactions.

Big Data integration layer enables data transmission, transformation, routing and protocol conversion between data providers and data consumers. Vast data sources provide distinct types of data formats in either real-time or incremental offering. The data is then integrated into different types of storages, including structured, semi-structured and unstructured storage systems. Different layers connect, transmit, retrieve data from/to the integration layer through qualified adapters, connectors, filters, APIs, and services. The data transmission can go across accelerators which prompt to speed up the connection performance among interfaces. The mixed type of data stored in the integration layer can be indexed, materialized and aggregated, thus accelerating the performance of data searching and data retrieving.

Data Warehouse (DW): Data warehouse is the central data repository for integrating data from massive data sources. It collects data from every channels of the system, processes and analyzes the data, and finally delivers the data by analytical reports obtained in the way of interaction with services APIs. Aiming to support enterprise-level strategies and decisions, DW provides guidance to those business intelligent companies, with business process improvement and several monitoring services (cost/quality/time).

Unlike modern Big Data SDDC, DW is identified as a somewhat traditional technology, in which traditional statistical technologies are more likely to be applied.

This layer enables data transmission, transformation, routing and protocol conversion between data providers and data consumers. The bridge between data providers and consumers can be realized as a component called ‘messaging system’.

Typical message system helps deliver the log of user-generated, device-generated, program-generated sub-systems to the data warehouse or the target data consumer. Some well-known open source messaging systems include but not limited to Apache Kafka (Apache Kafka, 2015), RabbitMQ (RabbitMQ, 2015), Apache ActiveMQ (Apache ActiveMQ, 2015), FastMQ (FastMQ, 2015) and ZeroMQ (ZeroMQ, 2015). The dataflow of Apache Kafka is shown as Figure 4. Sometimes these kinds of system require reliable distributed coordination tools, such as Apache Zookeeper (Apache Zookeeper, 2015). In addition, there are some tools helpful for data transfer and aggregation, including Apache Sqoop (Apache Sqoop, 2015), Apache Flume (Apache Flume, 2015) and Apache Chukwa (Apache Chukwa, 2015).

In addition to the messaging system, structured data and unstructured data can be aggregated to obtain value-added structured data for real-world applications. Shown as Figure 5, data from three types of resources can be involved and then searched, for example, document-oriented DB, Internet of Things (IoT) platform and traditional RMDBS systems. Nowadays, these three types of sources often directly provide the applications, in form of OLTP and OLAP. In the future, we believe two types of data can be integrated and aggregated into the next generation of databases.

2.7 LAYER SEVEN: BIG DATA SECURITY & PRIVACY

The issues of Security and Privacy (SP) are pervasive in all the open architecture. The design taking SP into account solve many issues, such as account breach (year of 2014) and data leakage. CIA security principle still exists in this new Big Data era, which should include confidentiality, integrity, availability, authorization, authentication, and even non-repudiation. Encryption and decryption techniques from cryptography lay foundation for Big Data SP. Virus and its transmission mechanisms are still valid in the present world. SP layer is closely related to the layer of data management and governance.

Security Issues: Handling security issues in Big Data era is much harder than ever before. In the Big Data circumstance, as depicted in Layer 6, the data comes from almost everywhere. User can easily control user generated data sources, but it is infeasible to control those machine-generated data as well as the program-generated data. Cloud vendors often play roles as Big Data generator and Big Data manager, which make data opaque to Big Data users and consumers, thus


leading to security issues. In general, those issues can be categorized as data trustworthiness, data access control and privacy protections.

Data Trustworthiness: Message authentication and digital signature methods played important roles in data encryption in terms of coarse granularity. However, when handling data with finer granularity, these traditional techniques can only partly verify their completeness but rarely obtain trustworthiness. The knowledge of data trustworthiness mainly includes: the issues brought by the data distortion during data transmission; the issues brought by man-made errors in the data measurement and data collection process; the issues brought by human introduced fraud data in the massive data pool; the issues brought by those historical data which it is impossible to trace.

Data Access Control: Completely fencing unauthenticated data consumers away from forbidden data domain is impractical, in particular, in the Big Data circumstance.

Privacy Protection: Privacy protection is not simply the process of destroying and removing user identifier. Typical issues of privacy protection include the anonymization/abstraction/deletion of user identifier, anonymization of user locations and anonymization of user identity linkage.

Privacy Protection Technologies: The privacy protection techniques has developed rapidly in recent years, but it is still far from satisfaction due to the 4V features of Big Data. In general, the areas that researchers participated in include anonymization of privacy preserving techniques, digital watermarking, data provenance, access control and role mining.

Anonymization Privacy Preserving: The knowledge body of Anonymization privacy preserving techniques includes the ones used for data publishing, typically the well-known k-member clustering problem, and social networks (SNS) privacy control. SNS privacy control techniques mainly manipulate the vertexes and edges of graphs. The edge of a graph represents the linkage between user identifiers. It is important to emphasize that though the anonymization techniques developed, the privacy

data could be predicated from those seemingly irrelevant user attributes.

Digital Watermarking: Digital watermarking can be used to identify ownership, often copyright information, of that signal. The key to digital watermarking is the utilization of “noise” in noise-tolerant signals. However, in the era of Big Data, watermarking reflects more and more shortcomings, in particular, when handling the vast “dynamic” data generated from memory iteration-like map-reduce algorithms/systems.

Data Provenance: Sometimes, in the process of data analysis, some results are presentable if and only if the source of data can be identified, especially Big Data from scientific research. Most results have to be reproducible if the same inputs are available. Data provenance is not new but deeply researched in database systems. Typical methods include but not limited to: annotation methods, schema transformation pathways, graph-based or query-based methods, bit vector-based.

Role Mining: Role Based Access Control (RBAC) can be considered as a potential strong candidate for replacing the traditional access control systems used in Big Data era. In RBAC, a set of user permissions are assigned to a set of users, therefore, a user-to-resource assignment is built intrinsically. A system user can get the pre-defined user permissions once his/her role is defined. The RBAC is based on a user provisioning system. There are three approaches to role mining: top-down, bottom-up, and by example.

2.8 LAYER EIGHT: BIG DATA QOS

In the era of Big Data, Quality of Service (QoS) is becoming increasingly significant due to its ability to provide different priority to different applications, and users, as well as to guarantee a certain level of performance to a data flow. Big Data Quaility of Service (QoS) includes QoS model, QoS requirement, service contract, service contract assurance techniques, and service contract exception handling.

QoS Model: QoS model focuses on defining the quality of a Big Data infrastructure, service and application. A QoS model includes various specified


metrics, availability, reliability, response time, scalability, capacity and others.

Availability: The proportion of time when a Big Data infrastructure, service and application is in a functioning position.

Reliability: The ability of a Big Data infrastructure, service and application to perform the required functions under specific conditions in a period of time.

Response Time: The time a Big Data infrastructure, service and application takes to react to a given request.

Scalability: The ability of a Big Data infrastructure, service and application to handle a growing amount of work in a capable manner.

Capacity: The volume that a Big Data service can provide.

QoS Requirement: QoS requirement focuses on quantitatively specifying user requirements on Big Data QoS, including elicitation, specification, and management.

Elicitation: The discovery of QoS requirement of Big Data service users.

Specification: The representation of the QoS requirements in the well-organized way.

Management: It is necessary to manage the QoS requirements and corresponding changes, since requirement may change drastically in the Big Data lifecycle.

Service Contract: QoS model represents the quality of various Big Data services, while QoS requirement enables the users to express their specific ideas on the Big Data services. Aiming at connect the Big Data provider and Big Data user, service contract is a contractual agreement on the quality of Big Data service to be provided by a provider to a user. Service contract contains negotiation, QoS objective, and presentation.

Negotiation: Negotiation means that an agreement between the Big Data infrastructure, service or

application provider and user is established tentatively

QoS Objective: QoS objective specifies the users’ requirements on different QoS properties.

Representation: Representation focuses on demonstrating the service contract in an easy understanding way.

Service Contract Assurance Techniques: Service contract assurance techniques refers to various techniques that can guarantee the quality of Big Data infrastructures, services or applications, including QoS monitoring, QoS prediction, and dynamic resource allocation.

QoS Monitoring: It monitors the quality of Big Data infrastructures, services and applications.

QoS Prediction: It predicts the unknown QoS through the historical QoS values.

Dynamic Resource Allocation: It reallocates Big Data related resources at runtime to optimize the service quality.

Service Contract Exception Handling: When the service contract is breached, necessary actions should be taken in time. Service contract exception handling involves different techniques to handle such exceptions as exception documentation, fault recovery, and abnormal behavior modeling.

Exception Documentation: It records detailed information of different types of exception.

Fault Recovery: It tentatively recovers the Big Data service after the exceptions.

Metrics and Quality Models for Abnormal Behavior: Abnormal behavior modeling models the abnormal behavior to improve the quality of Big Data infrastructures, services and applications.

Quality of Service (QoS), as a well known concept, has been studied for a very long time and expanded to various disciplines. Traditional metrics such as availability, reliability, response time, scalability, capacity could be directly applied to it.


A new metrics is the dynamic resource allocation. Cloud computing enables more elastic computing capabilities for Big Data analysis and services. Users could now exploit on-demand services and resources in a more dynamical way than ever before. This requires a better QoS to handle those dynamic components.

2.9 LAYER NINE: BIG DATA ECOSYSTEM GOVERNANCE & MANAGEMENT

The Big Data and complex analytics can drive the decision-making across all sectors of the government, industry, and academia. The Ecosystem Governance and Management layer aims to provide the persistent and measurable business value for business partners in the ecosystem. It is built upon a mutually agreed and accepted consensus of the definition, classification, reference architecture, business and technology roadmaps, policies and management processes. Such a layer should facilitate independent service delivery among vendor, technology, and infrastructure, enable partners in the ecosystem to choose the best-fit infrastructure, platform, as well as analytical technologies and tools for their own processing and application of gained intelligence, and obtain increased value from the data stream between the Big Data service provider and the stakeholders in a cohesive and secure manner.

It covers all aspects of effective life-cycle management, evaluation on policies and processes work and win-win collaboration among all parterres in the Big Data ecosystem. Figure 6 presents an overview of this layer.

Figure 6. Breakdown of the ABBs

Metadata Management: Different organizations, such as government, industry, and academia, could potentially have collaboration on the Big Data ecosystem to obtain mutual benefits. Therefore, it will be critical to reach the consensus and agreement on some fundamental definition, contexts, scope of various terminologies across many disciplines, and create a complete and effective metadata management strategy and overall system architecture. A fine metadata model with adequate system support will become a necessity. The metadata will cover the commonly accepted definition, classification, reference architecture (including security and privacy, inter-organization data policies and management processes), business and technology roadmap for Big Data itself. In addition, metadata requires its own management strategy, as well as the policies and processes for effective management.

Business Alignment: The adequate support from all the stakeholders is of great significance. Subsequently, the value proposition of the Big Data practice needs to be prepared and presented so that the proper expectations can be set and achieved. Since many business processes will be highly affected by the intelligence obtained from Big Data, a re-alignment is necessary. Such an action requires executive sponsorship and possible re-grouping of many departments. Big Data practitioners need to make periodic communication with top corporate executives to assure that the governance plan can get persistent and high quality support from all the concerned parties. In addition, a business and technology roadmap for adopting and rolling out different Big Data technologies should be created, reviewed and accepted. Execution: To guarantee the smooth transition and proper governance of the Big Data program, organizations need to continuously monitor their accuracy and effectiveness of analytics, estimation and prediction, as well as security and privacy. In addition, data and analytics applied to or from the Big Data program all have their own lifecycles, which means it could only be effective or valid in a due time and at a certain place. Therefore, they should be closely monitored with preventive or alternative actions when they become invalid. To continuously improve the Big Data utilization and


effectiveness, organizations need to use acceptable metrics to assess the effectiveness and maturity of the various Big Data related services, such as Big Data enabled business process.

Inter-Organization Collaboration: As discussed before, the Big Data ecosystem is most likely to involve many other organizations. For anyone who wants to get the best bang for the buck in the global environment of Big Data, he or she should profile the potential partners well. All the chosen partners are required to sign sufficiently detailed agreements on consequential business and monetary rewards or penalties. As a result, a win-win collaboration can be realized with various solution policies and processes.

Rules and Policies: It is crucial for the Big Data program to establish a set of rules and policies related to various aspects of the Big Data program, especially those related to data privacy, inter-organizational protection, consumption and optimization, quality of the data and derived analytics. They not only succeed to and extend the common and traditional data processing, but also serve as the foundation of the Big Data analysis and application,

thus bridging the gap between the two. It provides guidance and policies for assessment, management and decision-making from all aspects of the Big Data program, including capacity, performance, security and monitoring. The values of Big Data could be challenged due to the seriously concerned issues. For example, the privacy of personal information and decision-making power are infriged. The growing technological capabilities to store and analyze data lead to possible discrimination and failure to preserve fairness in an economic and political environment. The data-driven “personalization” fostered in an increasingly fragmented society triggers concerns about democratic values, accountability, and social cohesion. Specific rules, policies and processes at regulatory, professional, and organizational levels are

needed to ensure all the mentioned issues can be properly and fairly solved.

Metrics and Dashboards: To facilitate adoption and acceptance of the Big Data technologies, a set of metrics need to be clearly identified, reviewed and accepted by all stakeholders. The metrics should be based on the specification and scope of business domain, maturity assessment as well as the business

Figure 7. Enterprise Cloud Drive Architecture mapping to the Open Architecture


and technology roadmaps governing the overall Big Data program. They should also be aligned with the expectations of the sponsors.

Finally, as Big Data definitely introduce many business and operation changes, stakeholders at all levels should gain better and even complete understanding on the rules and policies that directly present the changes. The stakeholders can evaluate their impact on the business, for example, whether the data and the derived analytics are enough to support the decision-making, and if not, what alternatives could be invoked. Unified dashboards with all or most of the above-mentioned information with proper visualization support and corresponding actions will greatly facilitate any Big Data program.

Since the volume of data has increased rapidly in the last ten years, the volume of enterprise data will reach EB level. Under this circumstance, enterprises have to confront many complex issues, including serious isolated information island, data security and privacy as well as severe data governance. In fact, Big Data ecosystem itself is very complicated. However, the solutions of related issues can be less difficult due to the key controlling point, metadata management is. There are some industrial metadata management systems, such as ModeShape (ModeShape, 2015) and Pentaho (Pentaho, 2015). Moreover, the well-known standard for metadata management is Common Warehouse Model (CWM, 2015).

2. CASE STUDY

3.1 BDOA-BASED CLOUD DRIVE

In this section, we show an industrial product the architecture of which maps BDOA. The architecture of the system as shown in Figure 7. The product is an enterprise cloud drive for storing working documents, log files as well as analyzing value-added data and many other types of data generated from enterprise daily scenarios.

On the one hand, the cloud drive realizes the overall architecture and each level of BDOA. On the other hand, the cloud drive has some extended part in addition to the open architecture, such as extra components. These extra components are still relied

on the BDOA, but with some extra components to satisfy the Non-Functional Requirements (NFR).

The cloud drive is named KDrive (Kingdee, 2015). As mentioned previously, its main purpose is to store document files. As shown in Figure 7, there are components of hybrid cloud and legacy data system in Layer 1. The purpose of hybrid is to adapt some legacy systems as required. Since the relational databases are common to every cloud-based system, the cloud drive uses parallel database system MySQL (MySQL, 2015). NoSQL system (MongoDB, 2015) is used for storing the first-level (online) user behaviors for latter analysis. The second-level (offline) of user behaviors are transmitted to the distributed file system, Hadoop DFS (Apache Hadoop, 2015) for data mining. Document files are also stored in the DFS, because it has built-in backup scheme which allows better safety. Nowadays, virtualization is a must-have feature in every cloud. KDrive applies both the system-level virtualization KVM (KVM, 2015) and lightweight container technique (Docker, 2015). As mentioned in BDOA, the main advantages of using multiple level virtualization are economic efficiency and better resource isolation. In the industrial practice, we encourage that users can apply such a scheme when realizing the BDOA.

Layer 2 builds the foundation of Big Data processing upon the infrastructure layer. On this layer, no physical components are directly exposed in processing of data. System administrators establish and set data processing tools on virtual machines. For example, data collection is mainly implemented by open source tool, Apache Flume (Apache Flume, 2015). Some extra implementation is added and performed for porting ERP systems. In this way, data from users, tools, and machine environment are acquired. Data transmission and buffering are implemented by Kafka (Apache Kafka, 2015). It has extremely high I/O throughput allowing massive data flows from versatile data collection points. The open source tool, Hive (Apache Hive, 2015) is a typical and frequently used tool for data transforming and cleaning. KDrive cleans the dirty data stored in the HDFS (Apache Hadoop) and transforms them into structured data tables. Some OLAP work is also performed by Hive, with which, the Map&Reduce


operations (Apache Hadoop) are internally executed, allowing higher performance. Data modeling is sometimes an offline work. User behaviors are used to establish some user models, while product information is used to build object models. With the help of user models and object models, recommendation can be carried out by analyzing the corresponding user data and object information. Besides, some analysis results include personas and visiting statistics.

Layer 3 implements various services APIs for ‘users’, including enterprise users, public users and program-based users. All of the APIs are presented on a platform called KDrive OpenAPI Market (Kingdee, 2015). According to the usage classification, (i.e. enterprise uses, private uses, or program uses), the API Marketplace has a built-in corresponding feature called ‘credential control’. The smart API gateway is implemented so as to enable the feature.

As mentioned before, one type of business scenarios is to leverage Big Data to enrich a business process so as to increase efficiency and effectiveness of business operations. Layer 4 serves for this purpose. Using cloud drive in the scenario of enterprise storage helps exchange the work documents more efficiently. Another aspect is to reduce the cost of constructing duplicated storage clusters for business applications.

The primary goal of Big Data-enabled business processes is to enhance the efficiency and effectiveness of process execution by means of business intelligence and knowledge work. Enterprise-level cloud drive provides the facilities of knowledge mining, for instance, the activity of using the function called ‘team collaboration’. After harvesting the frequency of different activities, we can understand how much the vitality a group or an organization has, by meaning of analyzing the corresponding behavioral data. Each ‘team collaboration’ is intrinsically a process, telling where the file comes from and where the file goes to. A large number of the collaborations create multiple processes. In these processes, we apply machine learning techniques to improve user experiences, for instance, to automatically replicate the frequently-

used files to a data node close to a frequent user. Moreover, we apply the learnt and trained datasets in the processes and use the clustering methods to differentiate the distribution of the copies of files in different data nodes globally, which help to reduce the overall cost.

Layer 5 is the one has direct access to users. As a result, interactions between the internal system and outside world mostly happen in this layer. According to the interaction roles, the cloud drive setup different access points for different roles. For example, for the system administrator and custodian, it has a portal, while fro the users it has website and mobile APPs (both iOS and Android). As for the programs, API is the only interface. KDrive is not only for storing files from PCs and servers, it also supports keeping files from embedded systems and IoT (Internet of Things) devices. Therefore, the interaction hardware includes PC, mobile devices, IoT sensors and IoT gateways.

On Layer 6, KDrive stores three types of data: user files, log files generated from applications, web and internal components and program-generated files. Similar to Figure 4, Kafka is used for transmitting the messages. It also transmits small files from the source to the sink. At the source, Apache Flume and dedicated SDK are continuously collect different types of data, including structured or non-structured. At the sink end, first-level databases, MySQL and MongoDB are used to store real-time data. These data are then sent to second-level databases, HDFS. Finally, all sorts of data are collected and gathered into data warehouses, Cassandra and HBase. With the three levels of database, the cloud drive enables the integration of all kinds of data sources generated from enterprise scenarios.

Security is always the most importance concern in storage application. KDrive applies data access control techniques and anonymization techniques to maintain security. Data access control is done by making a multi-level gateway for using the micro-services. For example, user has first to log in to the account with name and password, and then the backend exchanges the authentication information, including app id, app key and token, with the servers. KDrive also records the visiting frequency to monitor the hacking behavior. Anonymization is implemented


by a consistent resource pool technique. There is no plaintext storage of user information in any database. In addition, all the transmissions are encrypted.

In BDOA, Layer 8 defines some QoS models. KDrive measures the availability, reliability, response time with its QoS monitoring interface. One importance target of checking these parameters is to make a dynamic resource allocation. KDrive applies lightweight visualization techniques for scaling, which means once the allocated resources are almost reaching the upper limit, it will conduct an automatic scaling of storage capability and services, during the scaling process, the scalability is checked and verified with the service contract, assuring an acceptable user experience is defined by QoS contract.

The metadata management is a key factor in managing a tremendous Big Data system. It is a key aces point to define a reasonable and widely-agreed metadata management policy with data users and inter-organizational partners. Once a common agreement made, the other factors are more like to be executable. It is difficult to collect multiple data sources and store many types of data. Nevertheless,

because KDrive defines metadata storage policies and data exchanging policies at the very beginning, the good kick-start enables the subsequent development processes and business negotiations less difficult and smoother.

3.2 AWS ARCHITECTURE

The Amazon Web Services (AWS) architecture overview with development and application hosting (T. Lococo, 2015) is shown as Figure 8. The technical stack with ‘AWS Global Infrastructure’, ‘Networking’, ‘Compute’ and ‘Storage’ layers maps to the Layer 1 of BDOA. The ‘Database’, ‘Analytics’ and ‘Application Services’ layers map to the Layer 1, 2 and 3 of BDOA, respectively. AWS offers a number of API marketplaces, applications, development, and libraries to the developers. In BDOA, such corresponding functionalities are provided as Layer 3 and 4. Moreover, BDOA possesses an extra outer layer ‘Big Data Interactions’ which offers dedicated logics for rich interactions. In the near future, we would like to see IoT, VR, AR and many other interaction devices bring much emotional interactions to the end users. The vertical

Figure 8. AWS Architecture Overview (T. Lococo, 2015)


layers named ‘Security’ and ‘Operation’ map to the Layers 7 and 9 of BDOA respectively. AWS does not highlight dedicated architecture layers for data integration and QoS control, but the corresponding functions are actually embedded into the other layers. The BDOA provides more detailed layers which can not only better readers’ understanding, but also meet the potential referencing needs of the future fine-sorted big data marketing.

4 CONCLUSION

The paper proposes the Big Data Open Architecture (BDOA). As mentioned at the beginning, we want to offer readers not only the concepts but also the mapped open source solutions and some industrial use cases. Therefore, in the content of the introduction of each layer, a detailed explanation of high-level Architecture Building Blocks (ABBs) is given followed by some open source tools which may support the realization of the high-level ABBs. After describing the BDOA, we give a concrete scenario of enterprise cloud drive which uses BDOA as a reference architecture.

As an overview, the paper cannot cover all aspects of Big Data, in particular, those emerging technologies. We actually leave two open questions to readers: what is the trend of Big Data Ecosystem? What kinds of components should be added as high-level ABBs to the BDOA. We are also looking for the answers and expecting that newly introduced technologies, knowledge and industrial practices of Big Data could be involved into the subsequent versions of BDOA. Some future works may contain extending BDOA as a reference architecture and using BDOA as a referenced framework for consulting.

5 ACKNOWLEDGMENT

The Services Society (S2) would like to thank all the volunteers who have made contributions to the editing, comments and suggestions on the open architecture. The following authors own our great

appreciation: Wu Chou partly drafted the layer one, Ke Ning and Xiangang Chen, the layer two, Bo Hu partly, the layer three, Yutao Ma, the layer four, Zibin Zheng, the layer eight, Min Luo, the layer nine and Bing Li, some contents of the ecosystem. We would also like express gratitude to Jiuyun Xu, Liang Chen and Xiwei Liu for their constructive comments on editing the paper.

6 REFERENCES

Services Society. (2015). Homepage. Retrieved Oct 30, 2015, from: http://www.servicessociety.org/en/introduction.html.

Baidu. (2015). Homepage. Retrieved Oct 30, 2015, from: http://www.baidu.com.

Google. (2015). Homepage. Retrieved Oct 30, 2015, from: http://www.google.com.

Openstack. (2015). Homepage. Retrieved Oct 30, 2015, from: http://www.openstack.org.

The R Project. (2015). Homepage. Retrieved Oct 30, 2015, from: http://www.r-project.org.

Apache Flume. (2015). Homepage. Retrieved Oct 30, 2015, from: http://flume.apache.org.

Apache Hadoop. (2015). Homepage. Retrieved Oct 30, 2015, from: http://hadoop.apache.org.

Apache Spark. (2015). Homepage. Retrieved Oct 30, 2015, from: http://spark.apache.org.

Apache Storm. (2015). Homepage. Retrieved Oct 30, 2015, from: http://storm.apache.org.

Apache Mahout. (2015). Homepage. Retrieved Oct 30, 2015, from: http://mahout.apache.org.

Cloudera Oryx. (2015). Homepage. Retrieved Oct 30, 2015, from: http://github.com/cloudera/oryx.

Scikit-Learn. (2015). Homepage. Retrieved Oct 30, 2015, from: http://scikit-learn.org/stable/.

Golearn. (2015). Homepage. Retrieved Oct 30, 2015, from: http://github.com/sjwhitworth/golearn.

Weka. (2015). Homepage. Retrieved Oct 30, 2015, from: http://www.cs.waikato.ac.nz/ml/weka/.

http://www.servicessociety.org/en/introduction.html

http://www.servicessociety.org/en/introduction.html

http://www.baidu.com/

http://www.google.com/

http://www.openstack.org/

http://www.r-project.org/

http://flume.apache.org/

http://hadoop.apache.org/

http://spark.apache.org/

https://storm.apache.org/

http://mahout.apache.org/

http://github.com/cloudera/oryx

http://scikit-learn.org/stable/

http://github.com/sjwhitworth/golearn

http://www.cs.waikato.ac.nz/ml/weka/


WSO2. (2015). Homepage. Retrieved Oct 30, 2015, from: http://wso2.com.

Mashape. (2015). Retrieved Oct 30, 2015, from: http://www.mashape.com.

D3js. (2015). Homepage. Retrieved Oct 30, 2015, from: http://d3js.org.

ECharts. (2015). Homepage. Retrieved Oct 30, 2015, from: http://echarts.baidu.com.

SiSense. (2015). Homepage. Retrieved Oct 30, 2015, from: http://www.sisense.com.

Moojnn. (2015). Homepage. Retrieved Oct 30, 2015, from: http://www.moojnn.com.

Wolfram Alpha. (2015). Homepage. Retrieved Oct 30, 2015, from: http://www.wolframalpha.com.

GnuPlot. (2015). Homepage. Retrieved Oct 30, 2015, from: http://www.gnuplot.info.

ModeShape. (2015). Homepage. Retrieved Oct 30, 2015, from: http://modeshape.jboss.org.

Pentaho. (2015). Homepage. Retrieved Oct 30, 2015, from: http://www.pentaho.com.

CWM. (2015). Homepage. Retrieved Oct 30, 2015, from: http://www.omg.org/cwm/.

Apache HBase. (2015). Homepage. Retrieved Oct 30, 2015, from: https://hbase.apache.org.

PostgreSQL. (2015). Homepage. Retrieved Oct 30, 2015, from: http://www.postgresql.org.

MySQL. (2015). Homepage. Retrieved Oct 30, 2015, from: https://www.mysql.com.

SQL Server. (2015). Homepage. Retrieved Oct 30, 2015, from: http://www.microsoft.com/zh-cn/server-cloud/products/sql-server/default.aspx.

Oracle Database. (2015). Homepage. Retrieved Oct 30, 2015, from: http://www.oracle.com/cn/database/overview/index.html.

MongoDB. (2015). Homepage. Retrieved Oct 30, 2015, from: https://www.mongodb.org.

BaseX. (2015). Homepage. Retrieved Oct 30, 2015, from: http://basex.org.

Apache CouchDB. (2015). Homepage. Retrieved Oct 30, 2015, from: http://couchdb.apache.org.

Apache Cassandra. (2015). Homepage. Retrieved Oct 30, 2015, from: http://cassandra.apache.org.

Redis. (2015). Hompage. Retrieved Oct 30, 2015, from: http://redis.io.

Memcached. (2015). Homepage. Retrieved Oct 30, 2015, from: http://memcached.org.

Cayley. (2015). Homepage. Retrieved Oct 30, 2015, from: https://github.com/google/cayley.

Neo4j. (2015). Homepage. Retrieved Oct 30, 2015, from: http://neo4j.com.

Brewer, E. (2012). CAP twelve years later: How the" rules" have changed. Computer, 45(2), 23-29.

Apache Parquet. (2015). Homepage. Retrieved Oct 30, 2015, from: https://parquet.apache.org.

Docker. (2015). Homepage. Retrieved Oct 30, 2015, from: https://www.docker.com.

Marathon. (2015). Homepage. Retrieved Oct 30, 2015, from: https://mesosphere.github.io/marathon/.

Mesos. (2015). Homepage. Retrieved Oct 30, 2015, from: http://mesos.apache.org.

Apache Hive. (2015). Homepage. Retrieved Oct 30, 2015, from: https://hive.apache.org.

KVM. (2015). Homepage. Retrieved Oct 30, 2015, from: http://www.linux-kvm.org.

Xen. (2015). Homepage. Retrieved Oct 30, 2015, from: http://www.xenproject.org.

OpenVZ. (2015). Homepage. Retrieved Oct 30, 2015, from: https://openvz.org.

Kuberntes. (2015). Homepage. Retrieved Oct 30, 2015, from: http://kubernetes.io.

Shipyard. (2015). Homepage. Retrieved Oct 30, 2015, from: https://shipyard-project.com.

DockerUI. (2015). Homepage. Retrieved Oct 30, 2015, from: https://github.com/crosbymichael/dockerui.

Pipework. (2015). Homepage. Retrieved Oct 30, 2015, from: https://github.com/jpetazzo/pipework.

Weave. (2015). Homepage. Retrieved Oct 30, 2015, from: http://weave.works.

http://wso2.com/

http://www.mashape.com/

http://d3js.org/

http://echarts.baidu.com/

http://www.sisense.com/

http://www.moojnn.com/

http://www.wolframalpha.com/

http://www.gnuplot.info/

http://modeshape.jboss.org/

http://www.pentaho.com/

http://www.omg.org/cwm/

https://hbase.apache.org/

http://www.postgresql.org/

https://www.mysql.com/

http://www.microsoft.com/zh-cn/server-cloud/products/sql-server/default.aspx

http://www.microsoft.com/zh-cn/server-cloud/products/sql-server/default.aspx

http://www.oracle.com/cn/database/overview/index.html

http://www.oracle.com/cn/database/overview/index.html

https://www.mongodb.org/

http://basex.org/

http://couchdb.apache.org/

http://cassandra.apache.org/

http://redis.io/

http://memcached.org/

https://github.com/google/cayley

http://neo4j.com/

https://parquet.apache.org/

https://www.docker.com/

https://mesosphere.github.io/marathon/

http://mesos.apache.org/

https://hive.apache.org/

http://www.linux-kvm.org/

http://www.xenproject.org/

https://openvz.org/

http://kubernetes.io/

https://shipyard-project.com/

https://github.com/crosbymichael/dockerui

https://github.com/crosbymichael/dockerui

https://github.com/jpetazzo/pipework

http://weave.works/


Flannel. (2015). Homepage. Retrieved Oct 30, 2015, from: https://github.com/coreos/flannel.

SocketPlane. (2015). Homepage. Retrieved Oct 30, 2015, from: http://socketplane.io.

Apache Kafka. (2015). Homepage. Retrieved Oct 30, 2015, from: http://kafka.apache.org.

RabbitMQ. (2015). Homepage. Retrieved Oct 30, 2015, from: https://www.rabbitmq.com.

Apache ActiveMQ. (2015). Homepage. Retrieved Oct 30, 2015, from: http://activemq.apache.org.

FastMQ. (2015). Homepage. Retrieved Oct 30, 2015, from: https://code.google.com/archive/p/fastmq/.

ZeroMQ. (2015). Homepage. Retrieved Oct 30, 2015, from: http://zeromq.org.

Swarm. (2015). Homepage. Retrieved Oct 30, 2015, from: https://docs.docker.com/swarm/.

Fleet. (2015). Homepage. Retrieved Oct 30, 2015, from: https://github.com/coreos/fleet.

Apache Zookeeper. (2015). Homepage. Retrieved Oct 30, 2015, from: https://zookeeper.apache.org.

ETCD. (2015). Homepage. Retrieved Oct 30, 2015, from: https://github.com/coreos/etcd.

Doozer. (2015). Homepage. Retrieved Oct 30, 2015, from: https://github.com/ha/doozer.

Spagobi. (2015). Homepage. Retrieved Oct 30, 2015, from: http://www.spagobi.org.

Birt. (2015). Homepage. Retrieved Oct 30, 2015, from: http://www.eclipse.org/birt/.

KNIME. (2015). Homepage. Retrieved Oct 30, 2015, from: https://www.knime.org.

Orange. (2015). Homepage. Retrieved Oct 30, 2015, from: http://orange.biolab.si.

RapidMiner. (2015). Homepage. Retrieved Oct 30, 2015, from: https://rapidminer.com.

KEEL. (2015). Homepage. Retrieved Oct 30, 2015, from: http://www.keel.es.

SPMF. (2015). Homepage. Retrieved Oct 30, 2015, from: http://www.philippe-fournier-viger.com/spmf/.

Rattle. (2015). Homepage. Retrieved Oct 30, 2015, from: http://rattle.togaware.com.

Apache Sqoop. (2015). Homepage. Retrieved Oct 30, 2015, from: http://sqoop.apache.org.

Apache Flume. (2015). Homepage. Retrieved Oct 30, 2015, from: https://flume.apache.org.

Apache Chukwa. (2015). Homepage. Retrieved Oct 30, 2015, from: https://chukwa.apache.org.

Kingdee. (2015). KDrive. Retrieved Oct 30, 2015, from http://pan.kingdee.com.

T. Lococo. (2015). AWS Architecture Overview with Development and Application Hosting. Retrieved Nov 1, 2015, from: https://cloudit4you.wordpress.com/tag/aws/.

L.-J. Zhang, J. Zhang, H. Cai. (2007). Services Computing. Beijing: Springer and Tsinghua Univ. Press.

Zhang, L.-J., Zhou, Q. (2009). CCOA: Cloud computing open architecture. In IEEE International Conference on Web Services (ICWS), 607-616.

Authors

Liang-Jie Zhang, a computer scientist, is the former Research Staff Member at IBM Thomas J. Watson Research Center. He is currently Senior Vice President, Chief Scientist and Director of Research at Kingdee International Software Group Company

Limited. He is also the founding Editor-in-Chief of IEEE Transactions on Services Computing. He was elected as a Fellow of the IEEE in 2011 and won the” IEEE Technical Achievement Award” for pioneering contributions to Application Design Techniques in Services Computing.

Huan Chen is a senior research fellow at Kingdee Research, Kingdee International Software Group. His recent research interests are big data architecture, cloud storage and machine learning.

https://github.com/coreos/flannel

http://socketplane.io/

http://kafka.apache.org/

https://www.rabbitmq.com/

http://activemq.apache.org/

https://code.google.com/archive/p/fastmq/

http://zeromq.org/

https://docs.docker.com/swarm/

https://github.com/coreos/fleet

https://zookeeper.apache.org/

https://github.com/coreos/etcd

https://github.com/ha/doozer

http://www.spagobi.org/

http://www.eclipse.org/birt/

https://www.knime.org/

http://orange.biolab.si/

https://rapidminer.com/

http://www.keel.es/

http://www.philippe-fournier-viger.com/spmf/

http://www.philippe-fournier-viger.com/spmf/

http://rattle.togaware.com/

http://sqoop.apache.org/

https://flume.apache.org/

https://chukwa.apache.org/


INDEXES

Ａ

Acquisition......................................................................... 5

Analytic Enablement Service .......................................... 7

Analytics ............................................................................. 6

Anonymization Privacy Preserving ............................ 14

API Gateway ...................................................................... 8

API Marketplace ................................................................ 8

Application of Interaction Devices ............................. 10

Availability ....................................................................... 15

Ｂ

Batch Processing Model .................................................. 6

Big Data Business Understanding/Insights ................. 6

Big Data Enabled Business Process Architecture . 8

Big Data Enabled Business Process Management 8

Big Data Enabled Scenarios ......................................... 8

Big Data Indexing ............................................................. 6

Big Data Integration .................................................... 13

Big Data Interaction Device ...................................... 10

Big Data Interaction Requirement ......................... 10

Big Data Replication and Recovery Mechanism ........ 6

Big Data Service Enablement ...................................... 7

Big Data Service Provisioning ..................................... 7

Big Data Sources .......................................................... 12

Big Data Storage Modeling ............................................ 6

Big Data Transaction Modeling and Versioning........ 6

Big Data Visualization ................................................ 10

Business Alignment...................................................... 16

Business Intelligence and Knowledge Management .. 9

Business Process Life-Cycle Management .................. 8

Ｃ

Capacity ........................................................................... 15

Cleaning ............................................................................. 5

Cloud ................................................................................... 3

Collection ............................................................................ 5

Compliance Management ................................................ 9

Computing Enablement Service ..................................... 7

Content Delivery Network (CDN).............................. 5

Correction ........................................................................... 6

Ｄ

Data Access Control ...................................................... 14

Data Provenance ............................................................ 14

Data Trustworthiness ..................................................... 14

Device-Generated Data ................................................. 12

Digital Watermarking .................................................... 14

Distributed Computing Principles ................................ 6

Distributed File System (DFS) ....................................... 4

Dynamic Resource Allocation ...................................... 15

Ｅ

Elicitation ......................................................................... 15

Exception Documentation ............................................. 15

Execution ......................................................................... 16

Ｆ

Fault Recovery ................................................................ 15

Functional Requirement ................................................ 10

Ｈ

Hybrid Cloud ..................................................................... 3

Ｉ

Integration .......................................................................... 5

Interaction Design Patterns.......................................... 10

Interaction Design Process............................................. 9

Interaction Hardware .................................................... 10

Interactive Big Data Analytics ....................................... 6

Inter-Organization Collaboration ........................... 17

Ｌ

Legacy Data System ....................................................... 5

Ｍ

Machine Learning and Predictive Models .................. 6

Management..................................................................... 15

Metadata Management ............................................... 16

Metrics and Dashboards............................................. 17 Metrics and Quality Models for Abnormal Behavior

....................................................................................... 15

Modeling ............................................................................ 6


Modeling Algorithms and Evaluation ........................... 6

Ｎ

Negotiation ...................................................................... 15

Non-Functional Requirement ...................................... 10

NoSQL System .................................................................... 3

Ｐ

Parallel Database System ................................................ 3

Policy and Rules ................................................................ 5

Privacy Protection ......................................................... 14

Privacy Protection Technologies ............................. 14

Private Cloud ..................................................................... 3

Process Data Repository Mining ................................... 9

Process Variation Management ..................................... 9

Program-Generated Data ............................................ 12

Prototyping ...................................................................... 10

Public Cloud ....................................................................... 3

Ｑ

QoS Model ...................................................................... 14

QoS Monitoring .............................................................. 15

QoS Objective ................................................................. 15

QoS Prediction ................................................................ 15

QoS Requirement ......................................................... 15

Ｒ

Reliability ......................................................................... 15

Representation ................................................................ 15

Response Time ................................................................ 15

Role Mining ..................................................................... 14

Rules and Policies ......................................................... 17

Ｓ

Scalability ......................................................................... 15

Security Issues ............................................................... 13

Service Contract............................................................ 15

Service Contract Assurance Techniques ............... 15

Service Contract Exception Handling .................... 15

Service Provisioning Platform ....................................... 8

Software Defined Technologies .................................. 4

Software-Defined Data Center (SDDC) ...................... 5

Software-Defined Environment (SDE) ......................... 4

Software-Defined Networking (SDN) ........................... 5

Software-Defined Storage (SDS) ................................... 5

Specification..................................................................... 15

Storage ............................................................................... 3

Storage Enablement Service ........................................... 7

Stream Computing Architecture .................................... 6

Ｔ

Target of Interaction Design .......................................... 9

The Methodology of Big Data Interactions ............ 9

Transmission ...................................................................... 5

Ｕ

Usability Evaluation....................................................... 10

User-Generated Data..................................................... 12

Ｖ

Visualization Enablement Service ................................. 7

Visualization Techniques .............................................. 11

Visualization Tooling ..................................................... 11

Documents

International Journal of Big Data (ISSN 2326- 442X) Vol. 2 ... · International Journal of Big Data (ISSN 2326- 442X) Vol. 2, No. 4, 2015 1 BDOA: BIG DATA OPEN ARCHITECTURE Version