8
By Firoz Mohamed Kasim, PMP © 2016 GAVS Technologies. All rights reserved. MongoDB and Python Key Ingredients for a Perfect Big Data Recipe WHITEPAPER To discover how GAVS can help you innovate and bring greater value to your business, write to [email protected] or visit www.gavstech.com.

MongoDB and Python - GAVS Technologies · MongoDB and Python: Key Ingredients for a Perfect Big Data Recipe 1 Open-source is the “way to go” for developing Big Data solutions

  • Upload
    others

  • View
    13

  • Download
    0

Embed Size (px)

Citation preview

Page 1: MongoDB and Python - GAVS Technologies · MongoDB and Python: Key Ingredients for a Perfect Big Data Recipe 1 Open-source is the “way to go” for developing Big Data solutions

By Firoz Mohamed Kasim, PMP

© 2016 GAVS Technologies. All rights reserved.

MongoDB and Python – Key Ingredients for a Perfect Big Data Recipe

WHITEPAPER

To discover how GAVS can help you innovate and bring greater value to your business, write to [email protected] or visit www.gavstech.com.

Page 2: MongoDB and Python - GAVS Technologies · MongoDB and Python: Key Ingredients for a Perfect Big Data Recipe 1 Open-source is the “way to go” for developing Big Data solutions

2

MongoDB and Python: Key Ingredients for a Perfect Big Data Recipe 1

Open-source is the “way to go” for developing Big Data solutions 4

The elements of a Big Data solution 4

Leveraging MongoDB to enhance performance and scalability 4

Implementing Analytics Framework using Python to accelerate time and value e�ciencies 5

Developing a Customized Dashboard Solution with Python 5

E�cient Data Sourcing with Bubbles 5

Log Management in Python 5

Heralding a new direction in Big Data with Open-source software 7

© 2016 GAVS Technologies. All rights reserved.

Contents

Page 3: MongoDB and Python - GAVS Technologies · MongoDB and Python: Key Ingredients for a Perfect Big Data Recipe 1 Open-source is the “way to go” for developing Big Data solutions

Abstract

In today’s highly connected world, enterprises are faced with exponential growth in the volume of data in both structured and unstructured formats. Broadly referred to as Big Data, these huge volumes and high complexity of data makes it di˜cult to process with the help of traditional data processing methods. Big Data is useful for companies as it leads to deeper insights through more accurate analyses. As a result, an increasing number of organizations are eager to harness the powers of Big Data.

However, to derive accurate and actionable insights from the data, best-ÿt solutions that use cost-e˛ ective and agile technologies are required. Innovative open-source products accelerate accessibility and productivity with their superior functionalities to support comprehensive data management and drive more informed decisions.

The paper highlights how implementing open-source technologies such as MongoDB and Python can help achieve a viable and long-term big data solution. Employing MongoDB provides high performance storage solutions and Python enables e˜cien t big data analytics with the assistance of its powerful libraries.

3© 2016 GAVS Technologies. All rights reserved.

Page 4: MongoDB and Python - GAVS Technologies · MongoDB and Python: Key Ingredients for a Perfect Big Data Recipe 1 Open-source is the “way to go” for developing Big Data solutions

Open-source is the “way to go” for developing Big Data solutions

Big data analytics has emerged as the key component in the analytics and information management domain. It enables integrated analysis of both structured and unstructured data, and o˛ ers powerful insights to make informed decisions and enhance productivity. However, to derive real business value from big data, the right tools are needed for capturing and organizing data for analysis and acquiring business insights. Several challenges had to be addressed before deploying an analytics platform using big data which include selecting right set of technologies suited to the diverse needs of the business to build the platform, integrating myriad data into the platform by synchronizing various data sources, and ensuring easy data accessibility and syndication.

Cost-e˛ ective open-source products o˛ er strong capabilities such as faster time-to-market and advanced technology features to develop compelling solutions for big data challenges. By leveraging open-source products such as Mongo-DB and Python, it is easier to perform big data analysis and accelerate strategic decisions and derive business value. An idea can be prototyped using free open-source software and technologies within a short span of time and made available for demonstration to target business audience.

The next section discusses a generalized recipe for an e˛ ective big data solution using open-source software.

4

Core elements of a Big Data solution

A typical big data solution requires a front-end dashboard, an analytics framework that acts as the backbone infrastructure, a data store, and a data sourcing solution. The front-end dashboard displays the results of data crunching; the analytics framework performs in-depth analysis, while a reliable, agile scalable storage site stores actual data and processes information. Another important element of the solution is a reliable channel for data sourcing that can be easily replicated to source data using Extract, Transform and Load (ETL) processes from transactional applications, social networking sites, mobile platforms, etc.

Leveraging MongoDB to enhance performance and scalability

Various traditional methods and tools can be used for building dashboards, performing analytics, sourcing data from various platforms, and storing variety of data. However, while building viable big data solutions, it is important to consider the escalating volume of data that is expanding beyond terabytes into exabytes and zettabytes.

The unstructured nature of data which may include graphical content adds another layer of complexity in building such solutions.

Data is the main “actor” for any big data solution and no enterprise can a˛ ord to have it lost permanently or even have it temporarily unavailable for processing. This reiterates the need for a reliable, highly-available and high-performance storage solution. An easy proposition for a NoSQL store, capable of processing high-volumes of semi-structured and unstructured data, could be MongoDB, which has become an increasingly popular cross-platform document-oriented database solution that is being adopted across industries. It is free and open-source, allowing for prototyping without any expenditure, while providing easy scalability, high performance and availability. It is classiÿed as a NoSQL database and uses a document-oriented structure for e˛ ective storage and retrieval of data.

As mentioned before, MongoDB is NoSQL and hence can therefore store data as-is. Moreover, due to this nature, a deÿned structure is not really required to store data, which makes it non-relational. The data is stored in the form of key-value pairs. However, it is advisable to have a primary structure in place, at least in the case of long-term integrated solutions, that enables organized storage of data for e˛ ective data retrieval. Python’s Ming framework is quite popular in enterprise circles for use with MongoDB which assists in organized storage of data.

There are some trade-o˛ s to be considered while using MongoDB for enterprise solutions. Though, MongoDB o˛ ers extremely simple programming interface forhandling large volumes of data and has extremehorizontal scalability, it does not support transactionalbehavior and integrity constraints. Hence, no ACIDbehavior is possible with MongoDB. Also, without anappropriate plan for storing data like the Mingframework, queries can take “forever” to retrieve the rightresults from enterprise-size databases.

MongoDB envisages use of a replication factor of three, which means data will be replicated thrice for storage. This makes the storage highly reliable and available at all times (high availability) for processing. Sharding is another feature of MongoDB where data can be spread across various machines to support the ever growing demands of data volume (performance and scalability). However, sharding requires careful selection of candidate keys to evenly spread the data across multiple machines.

© 2016 GAVS Technologies. All rights reserved.

Page 5: MongoDB and Python - GAVS Technologies · MongoDB and Python: Key Ingredients for a Perfect Big Data Recipe 1 Open-source is the “way to go” for developing Big Data solutions

Developing a Customized Dashboard Solution with Python

Python frameworks such as Flask, Django or Pyramid can be utilized to create front-end dashboards. Django Dash is a customizable, modular dashboard application framework that allows users to create bespoke dashboards. Python Flask can be employed to develop dashboards from scratch, whereas Flask-based dashboard solution can power interactive visualization and reporting.

Implementing Analytics Framework using Python to accelerate time and value e˜ciencies

Python, as a programming language, simpliÿes the development life cycle. Besides being easy to learn with simple implementable libraries and community support to make adaption of code easy, it possess the capability to process large amounts of data by using simple data structures. R, MATLAB and Octave are some of the other advanced analytical tools that ÿt this category with high processing capabilities. Though R, MATLAB and Octave are powerful in their statistical libraries, they do not o˛ er support for general purpose programming capabilities like web and server-side programming, graphical interface support, etc. Python, being a general-purpose language does not disappoint in these aspects. Python’s easy to understand syntax emphasizes readability, minimizing the cost of program maintenance. Python supports both structured as well as object-oriented programming for application development.

Python libraries like NumPy and SciPy provides enhanced utilities for number crunching and scientiÿc applications; Django and Flask provides micro containers suited for web development and deployment. Python also provides a varied list of libraries for myriad computing functions like Cryptography, Game Development, Geographic Information System (GIS), GUI programming, Multimedia processing, Image manipulation, Indexing and Searching, Networking, Plotting, Multi-language Processing, etc.

Python provides a library called PyMongo which contains tools for connecting and working with MongoDB. PyMongo provides native drivers to interact with MongoDB. Ming framework can be used to channel data from MongoDB data store for analytical processing. Ming framework helps enforce a schema-based behavior for documents obtained from MongoDB data store within Python applications.

E˜cien t Data Sourcing with Bubbles

Now, since we have the front-end dashboard, analytical processing framework and data storage solution options available, let us divert our attention on how to source data to a MongoDB data store from external applications or databases. In traditional computing sphere, we have various tools to perform this Extract, Transform and Load (ETL) process from multiple sources. Going the traditional route, open-source tools like Pentaho Data Integration and Talend Big Data Studio will ÿt the bill. While these tools have its own advantages and disadvantages, Python also provides ETL frameworks which rely on metadata for data sourcing, such as Bubbles. Bubbles provide data objects which are abstract in nature such as objects from CSV ÿles, SQL table representations, MongoDB collections, Twitter API objects, etc.

5

Log Management in Python

A necessary feature, even for a prototyping project, is e˛ ective log management. Logs are essential for tracking events that occur in an application. Error, Warning and Informational messages enable debugging in the event of potential failures. Runtime exceptions which prevent code from executing can be investigated only if logs are maintained in persistent storage. Python enables logging at various levels like Information, Debug, Warning and Error using its Logging module.

Numerous open-source monitoring tools typically referred to as “logging aggregators” like Sentry, Graylog2 and Scribe can also be used for log management. Raven is an open-source Python client for Sentry. Graylog2 has a graphical interface to search through log events and has libraries for major languages including Python.

© 2016 GAVS Technologies. All rights reserved.

Page 6: MongoDB and Python - GAVS Technologies · MongoDB and Python: Key Ingredients for a Perfect Big Data Recipe 1 Open-source is the “way to go” for developing Big Data solutions

Fig.1 shows the integration of various components of the Big Data solution discussed above

6© 2016 GAVS Technologies. All rights reserved.

Heralding a new direction in Big Data with Open-source software

Open-source software such as MongoDB and Python aims to enable agility, speed and ˝e xibility to software development process, thus revolutionizing the way ideas are transformed into marketable solutions. They herald a new direction in Big Data arena by accelerating the ecosystem maturity. In the near future, we can expect these complex custom solutions to be developed using graphical plug and play architectures with “ready-to-use”, “o˛-the -shelf”, “open-source” components requiring zero or minimal conÿguration tweaks.

Dashboard

Big Data ApplicationETL

Python FlaskDjangoPyramid Pyxley

Log ManagementAnalytics Framework &

General-purposeApplication Features

Python ProgramsLogging

MongoDB

Bubbles

DBS

EXTERNAL DATASOURCES

Data Store

ETL

Internet

Page 7: MongoDB and Python - GAVS Technologies · MongoDB and Python: Key Ingredients for a Perfect Big Data Recipe 1 Open-source is the “way to go” for developing Big Data solutions

7© 2016 GAVS Technologies. All rights reserved.

About the Author

Firoz Mohamed Kasim works as a Project/Program Manager at GAVS Technologies Pvt. Ltd., Chennai. He is a certiÿed Project Management Professional (PMP) with around 15 years of experience in the software sector. He also has ITIL – Foundation and FLMI LOMA certiÿcations to his credit. His interests include exploring new technologies and software products, show-casing architecture feasibility using new technologies, mobile app development, etc.

Page 8: MongoDB and Python - GAVS Technologies · MongoDB and Python: Key Ingredients for a Perfect Big Data Recipe 1 Open-source is the “way to go” for developing Big Data solutions

GAVS Technologies N.A., Inc10901 W 120th Avenue,Suite 110,Broomÿeld CO 80021, USA.Tel: +1 303 782 0402Fax: +1 303 782 0403

GAVS Technologies N.A., Inc116 Village Blvd, Suite 200,Princeton, New Jersey 08540, USA.Tel: +1 609 951 2256/7Fax: +1 609 520 1702

USA Middle East

GAVS Technologies LLCO˜ce No. 11, Bldg No : 4, Knowledge Oasis Muscat, Rusayl, Sultanate of Oman Tel: +968 24449301

GAVS TechnologiesP.O.Box : 124195, O˜ce no 202, Al Thuraiya Tower 1Dubai Internet CityDubai, UAETel: +971-4-4541234 [email protected] www.gavstech.com

GAVS Technologies (Europe) Ltd.3000 Hillswood Drive,Hillswood Business Park,Chertsey KT16 ORS,United KingdomTel: + 44 (0) 1932 796564

GAVS Technologies Pvt. Ltd.No.11, Old Mahabalipuram Road,Sholinganallur, Chennai,India - 600 119Tel: +91 44 6669 4287

UK

INDIA

About GAVSGAVS Technologies (GAVS) is a global IT services & solutions provider for customers across multiple industry verticals. GAVS o˛ ers services and solutions aligned with strategic technology trends to enable enterprises take advantage of futuristic technologies such as Cloud, IoT, Managed Infrastructure, and Security services.

GAVS has been recognized as an emerging player in the Healthcare Provider IT outsourcing sector by Everest Group, and as a prominent India-based Remote Infrastructure Management player by Gartner.

www.gavstech.comwww.gavstech.com

www.gavstech.com