Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Report on The Infrastructure for Implementing the Mobile
Technologies for Data Collection in Egypt
Date: 10 Sep, 2017 – Draft v 4.0
Implementing the Mobile Technologies for Data Collection in Egypt – UNECA – Nile University – CAPMAS
Project Infrastructure Report – Draft v 4.0 – 10 Sep 2017
Page 2 of 23
Table of Contents
1. Introduction .................................................................................................................................... 3
2. Infrastructure Reference Architecture .......................................................................................... 4
3. Current Status of CPI-Related Solutions ........................................................................................ 7
4. Targeted Data Management Continuum....................................................................................... 9
5. Current Infrastructure Architecture ............................................................................................ 11
6. Targeted Solution Architecture ................................................................................................... 14
7. Recommendations for Applications and Data Management ..................................................... 16
8. Main Recommended Components .............................................................................................. 17
9. Estimated Hi-Level Sizing and Specifications .............................................................................. 20
10. Conclusion and Next Actions ....................................................................................................... 22
11. References .................................................................................................................................... 23
Implementing the Mobile Technologies for Data Collection in Egypt – UNECA – Nile University – CAPMAS
Project Infrastructure Report – Draft v 4.0 – 10 Sep 2017
Page 3 of 23
1. Introduction
Realizing the advantages of using mobile technology for data collection and statistical production, the United Nations Economic Commission for Africa (ECA) is implementing a series of pilot projects on strengthening the capacity of African countries to use mobile technologies to collect data for effective policy and decision making. The pilot projects are designed to be executed by the National Statistical Office (NSO) in collaboration with a Training and Research Institute (TRI) designated by the NSO. The main partner in the project is the NSO in Egypt, called the Central Agency for Public Mobilization and Statistics (CAPMAS). CAPMAS has in turn designated Nile University as the TRI. The main objectives of the pilot project are as follows:
Strengthen the capacity of country to collect data with mobile technology
Experiment with self–enumeration using mobile devices to collect data and determine the suitability of such data for the production of statistics;
Strengthen working relationship between NSO and TRI in statistical development.
The focus of this report is to support CAPMAS to install and/or upgrade technical infrastructure, including computer servers and software to receive data from the project and integrate into standard statistical processes in Egypt, as well as to acquire handheld devices. Based on several meetings and assessment events with CAPMAS team, the current infrastructure and the targeted upgrades has been illustrated in this report. At the end, sizing estimates along with recommendations for Big Data components and platform has been made.
The main infrastructure achievement at CAPMAS is the virtualized data center which is recommended to be upgraded further to Cloud Computing platform. The National Institute of Standards and Technology (NIST) Cloud reference architecture is recommend to be sued to achieve a private cloud computing platform for this purpose.
Implementing the Mobile Technologies for Data Collection in Egypt – UNECA – Nile University – CAPMAS
Project Infrastructure Report – Draft v 4.0 – 10 Sep 2017
Page 4 of 23
2. Infrastructure Reference Architecture
For the sacked of standardizing the infrastructure design for the project, a suitable reference architecture need to be used. As the cloud computing provides several benefits and at the same time exiting data center provide a solid foundation for such approach, The National Institute of Standards and Technology (NIST) Cloud reference architecture will be used as detailed in reference 2, following are key points.
The Architectural Components of the NIST Reference Architecture describes the important aspects of service deployment and service orchestration. The overall service management of the cloud is acknowledged as an important element in the scheme of the architecture. Business Support mechanisms are in place to recognize customer management issues like contracts, accounting and pricing and are vital to cloud computing.
Following figure presents an overview of the NIST cloud computing reference architecture, which identifies the major actors, their activities and functions in cloud computing. The diagram depicts a generic high-level architecture and is intended to facilitate the understanding of the requirements, uses, characteristics and standards of cloud computing.
Implementing the Mobile Technologies for Data Collection in Egypt – UNECA – Nile University – CAPMAS
Project Infrastructure Report – Draft v 4.0 – 10 Sep 2017
Page 5 of 23
The NIST cloud computing definition is widely accepted as a valuable contribution toward providing a clear understanding of cloud computing technologies and cloud services. It provides a simple and unambiguous taxonomy of three service models available to cloud consumers: cloud software as a service (SaaS), cloud platform as a service (PaaS), and cloud infrastructure as a service (IaaS). It also summarizes four deployment models describing how the computing infrastructure that delivers these services can be shared: private cloud, community cloud, public cloud, and hybrid cloud. Finally, the NIST definition also provides a unifying view of five essential characteristics that all cloud services exhibit: ondemand self-service, broad network access, resource pooling, rapid elasticity, and measured service.
The NIST cloud computing reference architecture defines five major actors: cloud consumer, cloud provider, cloud carrier, cloud auditor and cloud broker. Each actor is an entity (a person or an organization) that participates in a transaction or process and/or performs tasks in cloud computing. Following table briefly lists the actors defined in the NIST cloud computing reference architecture:
Actor Definition
Cloud Consumer A person or organization that maintains a business relationship with, and uses service from, Cloud Providers
Cloud Provider A person, organization, or entity responsible for making a service
available to interested parties
Cloud Auditor A party that can conduct independent assessment of cloud services, information system operations, performance and security of the cloud implementation
Cloud Broker An entity that manages the use, performance and delivery of cloud
services, and negotiates relationships between Cloud Providers and
Cloud Consumers
Cloud Carrier An intermediary that provides connectivity and transport of cloud
services from Cloud Providers to Cloud Consumers
Implementing the Mobile Technologies for Data Collection in Egypt – UNECA – Nile University – CAPMAS
Project Infrastructure Report – Draft v 4.0 – 10 Sep 2017
Page 6 of 23
Our focus in this solution will be on the Private Cloud Model that need to be in place at CAPMAS as infrastructure of the mobile data collection applications as well as back end processing technologies. NIST defines A private cloud to give a single Cloud Consumer’s organization the exclusive access to and usage of the infrastructure and computational resources. It may be managed either by the Cloud Consumer organization or by a third party, and may be hosted on the organization’s premises (i.e. on-site private clouds) or outsourced to a hosting company (i.e. outsourced private clouds).
Implementing the Mobile Technologies for Data Collection in Egypt – UNECA – Nile University – CAPMAS
Project Infrastructure Report – Draft v 4.0 – 10 Sep 2017
Page 7 of 23
3. Current Status of CPI-Related Solutions
Currently, there is neither dedicated infrastructure for CPI related processing at CAPMAS nor back end processing components like database engines or big data platforms to handle data processing, transformation and modeling. Most work is done either manually or collected to spread sheets for processing and estimation of CPI and intermediate statistics and KPIs.
The following statistics provided by CAPMAS illustrates the workload for the CPI process in terms of effort needed by involved members:
KPI Measure Description
Number of Researchers 141 Filed persons assigned to collected data from the different markets
Number of Supervisors 31 Filed person assigned to manage filed operation of researchers
Number of Researchers per Supervisor
About 5 The average number of researchers being supervised by a supervisor
Overall number of governorates
27 Governorates where filed operation takes place
Overall number of regions 141 Regions where markets are located for collecting prices
Overall Number of markets
About 15000
Markets where prices are being collected
Number of markets per region
Not specified
Number of markets per regions where operation takes place
Number of markets per researcher
One region to a one
researcher
Number of markets assigned during one month to single researcher
Implementing the Mobile Technologies for Data Collection in Egypt – UNECA – Nile University – CAPMAS
Project Infrastructure Report – Draft v 4.0 – 10 Sep 2017
Page 8 of 23
Number of forms per researcher
11 Number of forms to be completed by a researcher in one month
Number of products per form
964 products
Number of products the researcher need to get prices for per each single form
Number of branch reviewers
60 Number of reviewers assigned to review the collected prices for each branch office
Number of head office reviewers
20 Number of reviewers at the head office responsible for the final review of prices collected from all filed operations
.
Implementing the Mobile Technologies for Data Collection in Egypt – UNECA – Nile University – CAPMAS
Project Infrastructure Report – Draft v 4.0 – 10 Sep 2017
Page 9 of 23
4. Targeted Data Management Continuum
The effectiveness of mobile data collection solution for the CPI Process requires the exitance of enterprise data management platform that is capable of handling collected data in integrated, secured and accessible way so that collaborative model among researchers, supervisors and CAPMAS branches, central departments and CPI departments can be achieved.
The current situation in the CPI process at CAPMAS lacks for such enterprise data management platform hence most of the process is done manually through paper forms except for the final analysis which is conducted using excel sheets or local desktop software prohibiting the value of collaborative data models. The target platform and infrastructure should fulfill the following main requirements split by each phase of the data management continuum:
Data Collection: enables automating the data sourcing, review, approval and consolidation using automated process through the workflow embedded into the mobile application for the filed researchers and their supervisors.
Implementing the Mobile Technologies for Data Collection in Egypt – UNECA – Nile University – CAPMAS
Project Infrastructure Report – Draft v 4.0 – 10 Sep 2017
Page 10 of 23
Data Aggregation: the sourced data from the mobile applications after review and approval needed to be aggregated properly into the backend database through direct connection and predefined rules defined by the CPI department.
Data Matching: ability to extract external data and maintain master data while provide ability to query date using predefined queries as well as ad-hoc queries. At the same time, enable augmenting CPI data with other data like spatial and geolocation data.
Data Quality: provide means for checking data quality and validation during the collection process and post collection while reviewing on the back-office processing and applying standard CPI statistical analysis.
Data Persistence: retain and organize data for as long time as possible while provides capabilities of multi structured data to save the cost of storage.
Data Consolidation: assemble data entities integrated into the back-end systems with flexible meta data management to ensure accessibility by specific roles.
Data Distribution: enable analysis tools to access, retrieve and communicate data in an intuitive way suitable to each level of CPI employees as well as structured for branches access and top management reporting.
The new model proposed to be implemented in the pilot project will address the above requirements for each area targeting an integrated data management platform that enables data integration, collaboration, retention using most recent big data management technologies. Transfer data directly to secured servers managed internally by CAPMAS including the following features:
End-to-end encryption using existing CAPMAS telecommunication infrastructure.
Reliable simultaneous connections to CAPMAS datacentre servers.
Online/offline synchronization.
GIS Integration.
Multilanguage.
Could architecture be used by all surveys and by all statistical processes.
Could architecture be easily used to handle the self-enumeration concept.
Implementing the Mobile Technologies for Data Collection in Egypt – UNECA – Nile University – CAPMAS
Project Infrastructure Report – Draft v 4.0 – 10 Sep 2017
Page 11 of 23
5. Current Infrastructure Architecture
At CAPMAS, virtualized data center infrastructure is used widely for other applications which can be leveraged for the CPI project with some modifications and upgrades as per the next sections. The current infrastructure is based on VMWare virtualization technologies as details in reference 3 main points are following.
VMware Infrastructure includes the following components as shown in above figure:
VMware ESX Server – A production-proven virtualization layer run on physical servers that abstract processor, memory, storage and networking resources to be provisioned to multiple virtual machines
VMware Virtual Machine File System (VMFS) – A high-performance cluster file system for virtual machines
Implementing the Mobile Technologies for Data Collection in Egypt – UNECA – Nile University – CAPMAS
Project Infrastructure Report – Draft v 4.0 – 10 Sep 2017
Page 12 of 23
VMware Virtual Symmetric Multi-Processing (SMP) – Enables a single virtual machine to use multiple physical processors simultaneously
VirtualCenter Management Server – The central point for configuring, provisioning and managing virtualized IT infrastructure
Virtual Infrastructure Client (VI Client) – An interface that allows administrators and users to connect remotely to the Virtual Center Management Server or individual ESX Server installations from any Windows PC
Virtual Infrastructure Web Access – A Web interface for virtual machine management and remote consoles access
VMware VMotion™ – Enables the live migration of running virtual machines from one physical server to another with zero downtime, continuous service availability and complete transaction integrity
Implementing the Mobile Technologies for Data Collection in Egypt – UNECA – Nile University – CAPMAS
Project Infrastructure Report – Draft v 4.0 – 10 Sep 2017
Page 13 of 23
VMware High Availability (HA) – Provides easy-to-use, cost effective high availability for applications running in virtual machines. In the event of server failure, affected virtual machines are automatically restarted on other production servers that have spare capacity
VMware Distributed Resource Scheduler (DRS) – Intelligently allocates and balances computing capacity dynamically across collections of hardware resources for virtual machines
VMware Consolidated Backup – Provides an easy to use, centralized facility for agent-free backup of virtual machines. It simplifies backup administration and reduces the load on ESX Server installations
VMware Infrastructure SDK – Provides a standard interface for VMware and third-party solutions to access VMware Infrastructure
Implementing the Mobile Technologies for Data Collection in Egypt – UNECA – Nile University – CAPMAS
Project Infrastructure Report – Draft v 4.0 – 10 Sep 2017
Page 14 of 23
6. Targeted Solution Architecture
While leveraging the current virtualized infrastructure using a cloud computing model is the designated approach, the target infrastructure has several roles in running the mobile data collection solution to work smoothly as planned. Those roles including as per reference 4:
Support the tabled mobile application communications for field researcher and supervisor applications.
Enable hosting and running the REST APIs and associated data services developed for the mobile application data interfacing.
Provide Big Data capabilities for long term data retention and high-performance computing.
For supporting the tabled mobile application communications for field researcher and supervisor applications, following figure shows the communications topology:
System Communication Diagram
Implementing the Mobile Technologies for Data Collection in Egypt – UNECA – Nile University – CAPMAS
Project Infrastructure Report – Draft v 4.0 – 10 Sep 2017
Page 15 of 23
The tablet devices are connected a 4G broadband cellular network
The end-to-end communication between field devices and the back-end server is done through a Virtual Private Network (VPN) tunneling to ensure data security.
Due to communication limitation, tablet devices should alternate between Online and Offline modes
In Offline mode, the tablet device can still gather and store data and save them locally on a local database that resides on the tablet
In Online mode, the device can synchronize the local and central database, send and receive messages and perform all other functions that require connectivity.
On the other side, for enabling hosting and running the REST APIs and associated data services developed for the mobile application data interfacing, following figure shows the main tablet mobile applications system components and data flow:
Mobile Tablet Applications System Modules Diagram
Providing Big Data capabilities for long term data retention and high-performance computing will be covered in next section.
Implementing the Mobile Technologies for Data Collection in Egypt – UNECA – Nile University – CAPMAS
Project Infrastructure Report – Draft v 4.0 – 10 Sep 2017
Page 16 of 23
7. Recommendations for Applications and Data Management
In the previous section on the tablet mobile application system components, the CAPMAS Backend Server is the landing space for collected data through the field researchers and supervisors. To provide Big Data capabilities for long term data retention and high-performance computing, and receiving additional data like self-enumeration and external sources integration, additional services will be integrated beneath the backend server receiving tabled data. The following features will be attained through the additional services:
# Feature Description
1 Distributed Data Management Data will be stored in distributed blocks on several nodes
enables granular management, scalability and high-performance computing.
2 Distributed Processing Aggregation, transformation, statistical analysis, data
modeling will be implemented on a distributed application framework to enable high performance scalable resilient computing.
3 Batch Loading Enable ingestion of accumulated data into batches for long
frequency loads.
4 Streaming Loading Enables ingesting data into small frequent streams of data in
the form of pipeline of messages or transactions.
5 In Memory Processing Running data analysis in selected set of data in memory for
faster processing and manipulation.
6 Data Science Modeling Specialized libraries that implements machine learning, deep
learning, statistical modeling, data mining and analysis operations atop of the data platform
7 Graph Analysis Components that enable big graph implementation and
network analysis models.
Implementing the Mobile Technologies for Data Collection in Egypt – UNECA – Nile University – CAPMAS
Project Infrastructure Report – Draft v 4.0 – 10 Sep 2017
Page 17 of 23
8. Main Recommended Components
Based on the previous sections of current status and targeted requirements, several component need to be installed to achieve needed upgrades of exiting infrastructure. The following sections describes recommended components subject to review during the implementation of infrastructure upgrades and setup:
- VMware vCloud Suite Leverage the current virtualized infrastructure into cloud management. vCloud Suite is an integrated offering that brings together VMware’s industry-leading vSphere hypervisor and VMware vRealize Suite multi-vendor hybrid cloud management platform. VMware’s new portable licensing units allow vCloud Suite to build and manage vSphere-based private clouds. Accelerate application delivery across both traditional and container based applications by giving developers the freedom to use the tools that make them most productive while still ensuring that applications can be moved seamlessly from developer laptop to production.
- Apache Hadoop Distributed File System (HDFS) Distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS was originally built as infrastructure for the Apache Nutch web search engine project.
- Apache YARN The fundamental idea of YARN is to split up the functionalities of resource management and job scheduling/monitoring into separate daemons. The idea is to have a global Resource Manager (RM) and per-application Application Master (AM). An application is either a single job or a DAG of jobs. The Resource Manager and the Node Manager form the data-computation framework. The Resource Manager is the ultimate authority that arbitrates resources among all the applications in the system. The Node Manager is the per-machine framework agent who is responsible for containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the Resource Manager/Scheduler. The per-application Application Master is, in effect, a framework specific library and is tasked with negotiating resources from the Resource Manager and working with the Node Manager(s) to execute and monitor the tasks.
Implementing the Mobile Technologies for Data Collection in Egypt – UNECA – Nile University – CAPMAS
Project Infrastructure Report – Draft v 4.0 – 10 Sep 2017
Page 18 of 23
- Apache Spark A fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.
- Apache Hive Data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Structure can be projected onto data already in storage. A command line tool and JDBC driver are provided to connect users to Hive.
- Apache HBase Provides random, realtime read/write access to your Big Data. This project's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google's Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.
- Apache Oozie Oozie is a workflow scheduler system to manage Apache Hadoop jobs. Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions. Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availability. Integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well as system specific jobs (such as Java programs and shell scripts). Oozie is a scalable, reliable and extensible system.
- Apache Tez building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data. It is currently built atop Apache Hadoop YARN. Provides expressive dataflow definition APIs, flexible Input-Processor-Output runtime model, data type agnostic, Simplifying deployment, performance gains over Map Reduce, optimal resource management, plan reconfiguration at runtime and dynamic physical data flow decisions
Implementing the Mobile Technologies for Data Collection in Egypt – UNECA – Nile University – CAPMAS
Project Infrastructure Report – Draft v 4.0 – 10 Sep 2017
Page 19 of 23
- Apache Flume A distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.
- Apache Sqoop A tool designed to transfer data between Hadoop and relational databases or mainframes. You can use Sqoop to import data from a relational database management system (RDBMS) such as MySQL or Oracle or a mainframe into the Hadoop Distributed File System (HDFS), transform the data in Hadoop MapReduce, and then export the data back into an RDBMS. Sqoop automates most of this process, relying on the database to describe the schema for the data to be imported. Sqoop uses MapReduce to import and export the data, which provides parallel operation as well as fault tolerance.
- MongoDB A document database with the scalability and flexibility that you want with the querying and indexing that you need. MongoDB stores data in flexible, JSON-like documents, meaning fields can vary from document to document and data structure can be changed over time. Will be used a document store for unstructured data.
- PostgreSQL A powerful SQL based database engine that will be used for landing mobile tablet applications collected data working behind the data services of the REST APIs. It provides extensive high-performance processing as well as special capabilities like GIS data handling.
Implementing the Mobile Technologies for Data Collection in Egypt – UNECA – Nile University – CAPMAS
Project Infrastructure Report – Draft v 4.0 – 10 Sep 2017
Page 20 of 23
9. Estimated Hi-Level Sizing and Specifications
The following table lists the estimated sizing for the infrastructure required for deploying and running the for mentioned components. Sizing will be revised during the implementation taking advantage from the cloud approach deployed on top of the virtualized infrastructure at CAPMAS data center:
# VM Function Estimated Node Sizing
1 2 x Name Nodes 4 Cores 3.0 GHz
16 GB RAM
200 GB Storage
Linux OS
2 2 x Resource Scheduling Nodes 4 Cores 3.0 GHz
16 GB RAM
200 GB Storage
Linux OS
3 8 x Worker Nodes 2 Cores 3.0 GHz
8 GB RAM
500 GB Storage
Linux OS
4 2 x Document Services Nodes 4 Cores 3.0 GHz
16 GB RAM
500 GB Storage
Linux OS
5 2 x REST APIs Hosting Nodes 4 Cores 3.0 GHz
16 GB RAM
100 GB Storage
Linux OS
Implementing the Mobile Technologies for Data Collection in Egypt – UNECA – Nile University – CAPMAS
Project Infrastructure Report – Draft v 4.0 – 10 Sep 2017
Page 21 of 23
6 2 x Central Database Nodes 4 Cores 3.0 GHz
16 GB RAM
500 GB Disk Space
Linux OS
7 2 x Back Office Applications 4 Cores 3.0 GHz
8 GB RAM
200 GB Disk Space
Windows Server
Implementing the Mobile Technologies for Data Collection in Egypt – UNECA – Nile University – CAPMAS
Project Infrastructure Report – Draft v 4.0 – 10 Sep 2017
Page 22 of 23
10. Conclusion and Next Actions
The achievement of virtualized infrastructure at CAPMAS is paving the way for building solid foundation for the mobile data collection solution as well as other potential data solutions and integration with external data sources. To leverage this achievement two main additional layers need to be build:
Extending Virtualization to Cloud Platform
Deploying Big Data Management Platform
Next Actions would include commencing in implementing plan for the two above items where implementation team need to be invited while ensuring complete know-how transfer to CAPMAS team specially on the Big Data management solutions as well as extending the backend capabilities to support the mobile data collection solution as the main focus of this pilot project.
Implementing the Mobile Technologies for Data Collection in Egypt – UNECA – Nile University – CAPMAS
Project Infrastructure Report – Draft v 4.0 – 10 Sep 2017
Page 23 of 23
11. References
1- UNECA – CAPMAS – Nile University Letter of Agreement (LoA).
2- Cloud Computing Reference Architecture: Recommendations of the National Institute of Standards and Technology http://ws680.nist.gov/publication/get_pdf.cfm?pub_id=909505
3- VMware Virtualization Documentation https://docs.vmware.com/en/VMware-vSphere/index.html
4- CAPMAS Pricing Tablet Application Requirements and Design Document.
5- VMware vCloud Suite https://www.vmware.com/products/vcloud-suite.html
6- Apache Hadoop Main Page http://hadoop.apache.org/