18
Case Study- Text Analytics on Unstructured Data

Bangalore Executive Seminar 2015: Case Study - Text Analysis on MongoDB for a large Enterprise

  • Upload
    mongodb

  • View
    519

  • Download
    4

Embed Size (px)

Citation preview

Page 1: Bangalore Executive Seminar 2015: Case Study - Text Analysis on MongoDB for a large Enterprise

Case Study-Text Analytics on Unstructured Data

Page 2: Bangalore Executive Seminar 2015: Case Study - Text Analysis on MongoDB for a large Enterprise

2 © Happiest Minds – Confidential

Our Business

 Digital Transformation for enterprises and technology providers leveraging an integrated set of disruptive technologies

Big Data & Analytics Mobility

Security

Cloud Social Computing Unified Communications

BPM, Workflow Business

Integration

IoT

Digital Enterprise

Page 3: Bangalore Executive Seminar 2015: Case Study - Text Analysis on MongoDB for a large Enterprise

3 © Happiest Minds – Confidential

Infrastructure Transformation and Managed Services

Public / Hybrid Cloud Transformation

Virtualization- Server , Storage, Network 

Private Cloud Migrations

End to End Infrastructure Transformation Services

Managed Infrastructure Services ( GNOC ) – 24X7

Basic - Managed Infra Services Advanced - Managed Infra Services

• Data base Monitoring and Mgmt.• Cloud Operations Monitoring and Mgmt.• Mobility / BYOD Monitoring and Mgmt.

• Big Data Provisioning and Mgmt.

• Audio/Video/IPT Monitoring and Mgmt.

Cloud Adoption  Strategy 

End to End Infrastructure Advisory Services

Data Center Consolidation

Software Defined

Data Center

Messaging & 

Collaboration

Audio/Video 

 

Big DataAdoption 

BYOD/Mobility  

Smart WorkspaceComputing

 

VirtualizationAdoption

 

Big Data 

Software Defined DCS, Network, Storage

Next Gen DCS Provisioning & Mgmt.

Converged Systems Migration

Automation & Administration

VDI Migration

Unified Device Migration

Messaging / Collaboration  

Audio/ Video/ IPT Enhancements

Mobility / BYOD 

Mongo DB migration/Acceleration

Database migration/Acceleration

Hadoop migration / Acceleration

• Service Desk • NOC Services – Server , Network , 

Storage , Database, Backup / Archival , Asset Mgmt. , Vendor Mgmt.

Enhanced - Managed Infra Services

ITSM Definition

Cloud and Next Gen Data Center Services  Unified Communication and Device Management

Page 4: Bangalore Executive Seminar 2015: Case Study - Text Analysis on MongoDB for a large Enterprise

4 © Happiest Minds – Confidential

Our Expertise

Data at Rest

Data in Motion

Structured , Multi-structured 

Flume

Sqoop

 Impala

Kafka

Pentaho

Hadoop

Columnar

Document

MPP

Apache, Cloudera, Hortonworks

HBase , Cassandra

MongoDB

Greenplum

DistributedStreamingDocumentation & 

Indexation

MapReduce StormSolar Cloud, Lucence

Visualization

Predictive Analysis

Machine Learning

Text Mining , NLP

R, Revolution R, Python

Mahout

Tableau, QlikViewQlikView, Cloudera, Cassandra, 

Revolution Analytics, Nagios, Apache Ambari,Ganglia,10Gen,DataSta,Platfora,Tableau

Page 5: Bangalore Executive Seminar 2015: Case Study - Text Analysis on MongoDB for a large Enterprise

5 © Happiest Minds – Confidential

CASE STUDY

Page 6: Bangalore Executive Seminar 2015: Case Study - Text Analysis on MongoDB for a large Enterprise

6 © Happiest Minds – Confidential

About Customer

• A multinational corporation based in India with revenue of $US 33 billion.

• Involved in Steel, Energy and Infrastructure services• Operation in 29 countries, employing over 60

thousand people.• The pilot is for Oil and Gas division of the company.

Page 7: Bangalore Executive Seminar 2015: Case Study - Text Analysis on MongoDB for a large Enterprise

7

Business Requirement

• What transactions happened with whom and when• Regulatory requirement: Communication logs must be

kept and reviewed by Auditors• Primary use case: Text Analytics on Email, Chat and

Audio Data Combined to spot deceitful transactions

Page 8: Bangalore Executive Seminar 2015: Case Study - Text Analysis on MongoDB for a large Enterprise

8

The Business Problem

• No single view of all communications happened through email, chat and voice.

• Auditors review process was a daunting task as they need to read through numerous email and chat files and need to listen to audio files to qualify a transaction as ‘clean’.

• Huge dependency on people maintaining these files systems.• No support for any scientific reasoning to back the findings of the

Auditors.• Brand Reputation was at risk.

Page 9: Bangalore Executive Seminar 2015: Case Study - Text Analysis on MongoDB for a large Enterprise

9

Data Challenge

• Semi and Unstructured data- Email, Voice and Chat.• There is not unique ID to bind the communication between the

channels.• Need for an algorithm for deep relevancy calculation. • More documents added to collections all the time.• Extracting features from the documents - challenge due to high

dimensionality and latent variables.

Page 10: Bangalore Executive Seminar 2015: Case Study - Text Analysis on MongoDB for a large Enterprise

10 © Happiest Minds – Confidential

The Approach

• Extract data from Email, Chat message and Audio files – Java.• Preprocess data- harvesting, synchronizing and harmonizing rich data

from communication media – Python and Java.• Storing, accessing and Processing- MongoDB.• Cluster coherent documents based on topic using LDA - Java Map

Reduce.• Report Generation-BIRT.

Page 11: Bangalore Executive Seminar 2015: Case Study - Text Analysis on MongoDB for a large Enterprise

11

Why MongoDB

Storing email, chat and voice files is in itself is problem. SQL database is not a right fit given its forced constraints and true relational models.

• With all the files as json objects, MongoDB made sense to take in these objects and query on them efficiently.

• MongoDB support for extremely simple and flexible data model allowed storing similar but different objects (chat and email from an user) as embedded documents rather storing data in different relational tables and relying on complex joins to retrieve data.

Page 12: Bangalore Executive Seminar 2015: Case Study - Text Analysis on MongoDB for a large Enterprise

12

Why MongoDB

• GridFS, database storage system for large objects helped to store voice files in raw format. Using efficient sort and filter options of MongoDB, were able to efficiently integrate and get email, chat and voice data as one group.

• MongoDB’s Aggregation framework provided an easy way to work on large documents to transform them to aggregated results.

• Java Map Reduce was used to construct term document matrix to identify shared commonalities in a corpus of documents. LDA (Latent Dirichlet Allocation) an advanced statistical method was used to determine the topic>>terms from document contents.

Page 13: Bangalore Executive Seminar 2015: Case Study - Text Analysis on MongoDB for a large Enterprise

13

Why MongoDB

• MongoDB’s Dynamic Schema was a key feature that powered our developers to track a new metric on the fly from arbitrary data pulled from SAP ERP.

• Various indexing techniques including Text Index and additional secondary Index allowed to quickly filter traders by user’s preferred criteria.

• High availability- Responsiveness of our service is key to our SLA. Replication feature in MongoDB gave us resilience to failures.

Page 14: Bangalore Executive Seminar 2015: Case Study - Text Analysis on MongoDB for a large Enterprise

14

Data Consolidation and Analysis Architecture

Logic Tier

Information is stored and retrieved from a database or file system. The information is then passed to logic tier processing and eventually back to the user.

This  layer  coordinates  the  application  , process  commands,  make  logical decisions  and  evaluations,  and  performs aggregations. It also moves and processes data    between  the  two  surrounding layers.

Data Tier

Top  most  level  of  the  application  is  the user  interface.  The  main  function  is  to translate  tasks  and  results  to  something that user can understand.

Presentation Tier

Database Data Parser/Loader

Email

Audio

Chat

Query

Get List of messages sent to Yahoo ID between a data range?

Get List of all communications sent to Yahoo ID on June 10?

ChatEmailAudio

Reports

Page 15: Bangalore Executive Seminar 2015: Case Study - Text Analysis on MongoDB for a large Enterprise

15

Text Analytics Application Model

Rich Queries Find everybody who did $25 million transaction last week or did a transaction with a supplier in China last week worth $ 1 million to $ 25 million.

Aggregation What is the number of traded deals for particular product in time range.

Text Search Find all documents that mention a supplier

Map Reduce List all documents based on a topic or document terms (discovered using LDA algorithm)

Page 16: Bangalore Executive Seminar 2015: Case Study - Text Analysis on MongoDB for a large Enterprise

16

What was achieved

• The results were staggering!• RAAD: Completed pilot development in three weeks, which otherwise

would have taken couple of months. • Performance: The application was responding to user queries within

50 millisecond window. MongoDB enabled low-latency queries across thousands of documents.

Page 17: Bangalore Executive Seminar 2015: Case Study - Text Analysis on MongoDB for a large Enterprise

17

Business Benefits

• Provided single view of all communications per transaction.• Auditor’s evaluation time brought down from 2 weeks to 1 day.• Saved around 300 man hrs., which consumed for manual

consolidation of data from email, chat and audio servers.• Text Analytics application offered new insights like deeper

understanding of supplier market which was not possible before.

Page 18: Bangalore Executive Seminar 2015: Case Study - Text Analysis on MongoDB for a large Enterprise

Thank youwww.happiestminds.com