Enabling Big Data with IBM InfoSphere Optim

Enabling Big Data with InfoSphere Optim Session # ILM-1742A

Vineet Goel, IBM Guenter Sauter, IBM [Product Management]

Please note IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice at IBM’s sole discretion.

Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision.

The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future products may not be incorporated into any contract. The development, release, and timing of any future features or functionality described for our products remains at our sole discretion.

Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user’s job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here.

Acknowledgements and Disclaimers Availability. References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates.

The workshops, sessions and materials have been prepared by IBM or the session speakers and reflect their own views. They are provided for informational purposes only, and are neither intended to, nor shall have the effect of being, legal or other guidance or advice to any participant. While efforts were made to verify the completeness and accuracy of the information contained in this presentation, it is provided AS-IS without warranty of any kind, express or implied. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this presentation or any other materials. Nothing contained in this presentation is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the use of IBM software.

All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics may vary by customer. Nothing contained in these materials is intended to, nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific sales, revenue growth or other results.

© Copyright IBM Corporation 2013. All rights reserved.

• U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

IBM, the IBM logo, ibm.com, InfoSphere, and Optim are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml

Other company, product, or service names may be trademarks or service marks of others.

Agenda

Need for Governing Big Data 1

4

5

2

3

Data Privacy for Big Data

Lifecycle Management for Big Data

Test Data Management for Big Data

Review

Make Informed Decisions

Uncover competitive advantages

Identify new opportunities

Rapid, easy access to big data, wherever it resides

Easy categorization, indexing, discovery of big data to optimize its usage

Definition and execution of governance appropriate to data value and intended use

Acting on Insight Requires Confidence in Data

Automated Integration Agile Governance Visual Context

Take Bigger, Calculated Risks

Information Integration & Governance for Big Data

IBM Information Integration and Governance portfolio for Big Data

6

Information Integration & Governance

Data Warehouse

Stream Computing

Hadoop System

Discovery Application Development

Systems Management

BIG DATA PLATFORMS

InfoSphere Guardium

InfoSphere Optim

InfoSphere Information Server Understand, Integrate, deliver and govern data across information systems

InfoSphere Master Data Management

Manage information through its lifecycle while meeting data privacy & retention compliance

Act on trusted views of your master data to improve your critical business processes

Monitor, protect and audit enterprise data to ensure security and compliance

Data Quality MDM Privacy &

Security Data

Lifecycle Information Integration

Open Architecture/ Multiple Product Entry Points

Information Ingestion

and Integration

Data

Exploration

Archive

Real-time Analytics

Information Governance, Security and

Business Continuity

Data

Exploration

Enterprise Warehouse

Data Marts

IBM Big Data and Analytics Reference Architecture

business users (with an idea), power users, data analysts

data scientist / data miner,

advanced business user,

application developer

traditional IT / application

developer

search & survey

exploratory analysis

operational

l  text search l  simple investigations l  peek / poke

l  from mountain of data into a structured world with apps to provide business value

l  iterative in nature, many false starts, needs many skill sets/people

l  creating/standing up applications, processes, systems with enterprise characteristics

l  more formal environment, SLAs, etc

Big Data life cycle – from raw to production

Initial / exploratory use cases

Used for business decisions

Requirements change over the course of the life cycle

Little security concerns Protect, Secure, Encrypt

Sporadic change management Audit trail tracking access & changes

No data retention requirements Preserve data for N years

Little to no regulation Legislated requirements

No / isolated data quality concerns Data quality imperatives

Sources of information are “interesting” Sources must be trusted

Big Data Best practice processes

People

Process

Technology It’s not all about technology…

Information Integration & Governance Technology

IBM Information Governance Unified Process

IBM Governance Process

Overview: IBM InfoSphere Optim

Production

Dev/Test

Archive

Discover Understand

Classify

§  Archive cold data to improve application performance & streamline backups

§  Reduce hardware, software, storage & maintenance costs for enterprise applications

§  Support data retention regulations & safely retire legacy/redundant applications

Data Archiving

§  Reduce cost, reduce risk and speed application delivery by provisioning right-sized test environments

§  Ensure compliance & privacy with test data masking

Test Data Management

§  Accelerate data management projects by discovering complex data relationships & sensitive data elements in your data assets

Discovery

• Archive • Retire

DATA

DATA

• Mask • Subset • Compare • Refresh Data Masking

§  Ensure data privacy compliance by masking sensitive data


4

5

2

3




Review

13

Data Privacy Challenges & Considerations

Ø  Customers take “data privacy” seriously! Ø  Organizations need to de-identify, mask and

transform sensitive data in data environments to avoid issues of data breach Ø Privileged access misuse, data theft,

data movement across data centers or hosted environments, outside contractors or offshore project teams

Ø  Apply transformation techniques to substitute sensitive data with contextually-accurate but fictionalized data to produce accurate test results

Ø  Support compliance with local, state, national, international and industry-based privacy regulations

Keeping up with Global & Industry Regulations

Canada: Personal Information Protection

& Electronics Document Act

USA: Federal, Financial & Healthcare

Industry Regulations & State Laws

Mexico: E-Commerce Law

Colombia: Political Constitution –

Article 15

Brazil: Constitution, Habeas Data &

Code of Consumer Protection & Defense

Chile: Protection of

Personal Data Act Argentina:

Habeas Data Act

South Africa: Promotion of Access

to Information Act

United Kingdom: Data Protection

Act

EU: Protection Directive

Switzerland: Federal Law on Data Protection

Germany: Federal Data Protection

Act & State Laws

Poland: Polish

Constitution

Israel: Protection of Privacy Law

Pakistan: Banking Companies

Ordinance

Russia: Computerization & Protection of Information

/ Participation in Int’l Info Exchange

China Commercial Banking Law

Korea: 3 Acts for Financial

Data Privacy

Hong Kong: Privacy Ordinance

Taiwan: Computer- Processed

Personal Data Protection Law

Japan: Guidelines for the

Protection of Computer Processed Personal Data

India: SEC Board of

India Act

Vietnam: Banking Law

Philippines: Secrecy of Bank

Deposit Act Australia:

Federal Privacy Amendment Bill

Singapore: Monetary Authority of

Singapore Act

Indonesia: Bank Secrecy Regulation 8

New Zealand: Privacy Act

Industry Regulations like: • PCI-DSS • HIPAA • GLB PII such as: • Names • Account # • CCN • SSN • DOB • Addresses • Driving Lic • IP Address • Medical • Telephone #

Optim & Redaction Guardium Business Info Exchange

Monitor, Audit & Secure

Discover, Define & Collaborate

Mask & Protect

New IBM Offering: InfoSphere Data Privacy for Hadoop

Share business glossary, privacy policies, project

blueprints Protect structured and

unstructured data

De-identify sensitive data at source or within

Hadoop

Centralized reporting of audit data

Enforce security policies

Explore data lineage

Discover relationships & sensitive data Monitor & audit

activities in Hadoop

Business Information Exchange

§  Facilitate business & IT communications via a common business vocabulary

§  Specify information governance policies and rules

§  Understand where data comes from and where it goes

Requirements

Benefits §  Facilitates collaboration

on reference architectures, leveraging the same vocabulary

§  Aligns the efforts of IT with goals of the business

Collaborate on big data reference architecture and define a common business language

Business Info Exchange

17

What is data masking? q  Definition

Method for creating a structurally similar but inauthentic version of an organization's data. The purpose is to protect the actual data while having a functional substitute for occasions when the real data is not required.

q  Requirement Effective data masking requires data to be altered in a way that the actual values cannot be determined or reengineered, functional appearance is maintained.

q  Other Terms Used Obfuscation, scrambling, data de-identification

q  Commonly masked data types Name, address, telephone, SSN/national identity number, credit card number

q  Methods o  Static Masking: Obfuscating data values that ultimately get persisted in the

updated database. Often rows are moved and masked as a single operation, though data may be updated in place.

o  Dynamic Masking: Masks specific data elements on the fly without modifying the applications or physical production data store.

18

Data Masking

InfoSphere Optim

Mask

Before Masking After Masking

§  Protect sensitive information (PII) from misuse and fraud and data breaches

§  Protect confidential data while preserving analytics

§  Achieve better information governance & regulations compliance

§  Mask data in dbms, delimited text files, or in ETL

§  Mask sensitive data in Hadoop using MapReduce

§  Proven masking algorithms

§  Callable masking APIs

Requirements

Benefits CSV More…

Hadoop

Anonymize sensitive information used in Hadoop with realistic but

fictional data

Mask at the source Mask in-‐flight Mask in-‐Hadoop (MapReduce)

19

Example 2 Example 1

PersNbr FstNEvtOwn LstNEvtOwn 27645 Elliot Flynn 27645 Elliot Flynn

Event Table

PersNbr FstNEvtOwn LstNEvtOwn 10002 Pablo Picasso

10002 Pablo Picasso

Event Table

Personal Info Table

PersNbr FirstName LastName 08054 Alice Bennett 19101 Carl Davis 27645 Elliot Flynn

Personal Info Table

PersNbr FirstName LastName 10000 Jeanne Renoir 10001 Claude Monet 10002 Pablo Picasso

InfoSphere Optim Data Masking Techniques

A comprehensive set of data masking techniques to transform or de-identify data, including: v String literal values v Random or Sequential numbers v Lookup / Hashing v Credit Cards

v Arithmetic expressions v Concatenate or Substring v Format-Preserving v National ID/ SSN

v Shuffling v Date Variance v User Defined v Email

Referential integrity is maintained with key propagation

Customer Information

Patient No. SSN

Name

Address

City State Zip

112233 123-45-6789

Amanda Winters

40 Bayberry Drive

Elgin IL 60123

123456 333-22-4444

Erica Schafer

12 Murray Court

Austin TX 78704

Data is masked with contextually correct data to preserve integrity of test data

20

Data Masking in-Hadoop leveraging MapReduce

•  Data Masking application can run natively in Hadoop clusters using the standard MapReduce technology for highly “scalable processing”

•  Support for masking delimited files in HDFS

•  Data masking libraries are exposed via Java API and invoked in the Reducer

Hadoop Cluster

masked Data files

MapReduce based Masking

Application

APIs

Data files

Optim Masking Application in Hadoop

Ø  2.95 Millions Elements Masked per Second

Ø 2.56 Billion Elements masked in ~15 minutes

Pure

Data

Sys

tem

fo

r Had

oop

1.14

1.45

1.72

1.98 2.05

2.95 2.88

0

0.5

1

1.5

2

2.5

3

Elem

ents

Mas

ked

per s

ec (i

n M

illio

n)

80 M 160 M 320 M 640 M 960 M 1.92 B 2.56 B# of Records submitted (in Millions/Billions)

PureData for Hadoop (BigInsights):Masking in Hadoop MapReduce Application

18 node cluster

For Web Logs, Clickstream Analysis

User IDs, Birth Date

23

For XML Data references

<?xml version="1.0" encoding="utf-8"?> <customers> <customer>  <first_name>Bobby</first_name> <middle_initial>J</middle_initial> <last_name>Fudge</last_name> <address> < street>100 Fifth Avenue</street> <city>New York</city> <state>NY</state> <zip>10014</zip> </address> <ccn>5411116857029116</ccn> <telephone>1-609-156-5648 </telephone> <email_address> [email protected] </email_address> </customer> </customers> © 2012 IBM Corporation

Before XML Document After XML Document <?xml version="1.0" encoding="utf-8"?> <customers> <customer>  <first_name>Bobby</first_name> <middle_initial>J</middle_initial> <last_name>Fudge</last_name> <address> <street>100 Fifth Avenue</street> <city>New York</city> <state>NY</state> <zip>10014</zip> </address> <ccn>5411110000000017</ccn> <telephone>1-609-321-7654 </telephone> <email_address> [email protected] </email_address> </customer> </customers>

24

For Data in NoSQL, Internet Commerce

{ name : "Matt Kalan", title : ["Account Manager", "Solutions Architect"], phone : "+1 347 688-5694", location : "New York, NY", email : "[email protected]", web : ["mongodb.com", "Mongodb.org"], linkedin : ["mkalan", "Mongodb"] twitter : ["@MatthewKalan", "@MongoDB", "@MongoDBInc"], facebook : ["MongoDB", "MongoDB, Inc."] }

}

{ name : "Matt Kalan", title : ["Account Manager", "Solutions Architect"], phone : "+1 347 654-1234", location : "New York, NY", email : “[email protected]", web : ["mongodb.com", "Mongodb.org"], linkedin : ["mkalan", "Mongodb"] twitter : ["@MatthewKalan", "@MongoDB", "@MongoDBInc"], facebook : ["MongoDB", "MongoDB, Inc."] }

}

For Call Data Records, Mobile Apps Phone numbers, Call history

IMEI

Data Redaction

§  Protect unstructured data in textual, graphical and form based documents

§  Control data views with user role policies

§  Automate batch workflow process with optional human review

Requirements

Benefits §  Prevent unintentional

data disclosure

§  Comply with regulatory and corporate compliance standards

§  Increase efficiency and reduce risk via automation

Protect sensitive unstructured data in documents, forms & text

Data Redaction

Date: April 12, 2007 Patient Name: John Smith

Date of Birth: June 05, 1962 Social Security Number: 035-01-1271

Ref No. MR 2335/324 Insurance Provider Aetna

Background: Mr. John Smith was admitted to Sioux General Hospital at

05:15 AM on 12 April 2001, transferred from Brookdale Psychiatric Hospital after a fall as a result of a left-side

weakness. …

Redact/ Mask

For Text Logs, Mobile Apps or Customer Service Experience

Ability to parse unstructured, structure and semi-structured content: - Voice to Text Logs - Agent Notes - Text Chats - Social media feeds

Agent: “Mr Smith, let me verify the phone number associated with your account?” Customer: “408-555-1212” Agent: “Thank you. Let’s discuss the problem you are having with your iPhone 5 and the battery issue”…

Agent: “[NAME], let me verify the phone number associated with your account?” Customer: “[PHONE]” Agent: “Thank you. Let’s discuss the problem you are having with your iPhone 5 and the battery issue”…

Hadoop Activity Monitoring

§  Protect sensitive information from misuse and fraud

§  Prevent data breaches and associated fines

§  Achieve better information governance & security

Monitor & Audit Key Hadoop events:

§  Session and User Information

§  HDFS Operations – Commands, Files, Permis.

§  MapReduce Jobs §  Exceptions like

authorization failures §  Hive/HBase queries

Requirements

Benefits

Monitor and audit Hadoop activity in real-time to support compliance requirements and

protect data

InfoSphere Guardium Collector Appliance

S-TAPs

•  Who is submi;ng specific requests? •  What MapReduce jobs are they running? •  Are jobs part of an authorized programs? •  Too many file permission excepGons?

Hadoop


4

5

2

3




Review

30 30

Organizations have been increasingly challenged with successfully managing data growth

Increasing Costs Poor Data Analysis Performance

Manage Risk & Compliance

Business users wait for analytic query responses; slow-performing business intelligence (BI) solutions

impact business agility

Supporting the data retention and legal hold requirements for large volumes of data.

The volume of growth impacts the warehouse capacity, where traditional strategies may not

be enough

Integrate big data and data warehouse capabilities to increase operational efficiency"

Extend warehouse infrastructure •  Optimize storage, maintenance and licensing

costs by migrating rarely used data to Hadoop •  Query-able access to data •  Governance and Policy-driven archiving

Challenges

ü Are you drowning in very large data sets (TBs to PBs)?

ü Do you use your warehouse environment as a repository for ALL data?

ü Do you have a lot of cold, or inactive data in your database?

ü Are you having to throw data away because you’re unable to store or process it?

ü Are you interested in using your data for traditional and new types of analytics?

Data Warehouse Augmentation – Queryable Archive

Data Archiving

InfoSphere Optim

Hadoop

Archive data into storage of choice. Manage data growth, lower cost &

meet retention compliance.

-‐ Archive/Purge -‐ Heterogeneous

-‐ Compressed -‐ Immutable

Query-‐able & analyGcal store

• Capture complete business object • Preserve Data Integrity • Preserve Schema Metadata • Apply RetenGon / Hold Policies • Load data into Hadoop for analyGcs

Archive files

§  Reduce hardware, storage and maintenance costs of traditional dbms’s

§  Improve performance of traditional systems by offloading inactive data

§  Data access from Hadoop’s query-able/analytical store

§  Discover, archive, query, retain and purge data per business policies

§  Native connectivity, complete business objects, referential integrity

§  Augment data warehouses & offload cold data to lower cost platform

Requirements

Benefits IMS VSAM More…

Archive/Offload data into Hadoop Manage data growth, Lower TCO & Meet data retention compliance

ü Apply Retention / Hold Policies ü Capture complete business object ü Preserve Data Integrity ü Preserve Schema Metadata ü Load data into Hadoop as needed

Archive Cold Data

Query-‐able analyGcal data store, using Hadoop Archive & Purge Data

InfoSphere Optim

Compressed, immutable, auditable & restorable archives

Database

IMS VSAM More…

Archive files Hadoop

SQL Access

Data Warehouse

Data Warehouse Augmentation Architecture Overview

BigInsights

Sources

Optim Data

Growth

Archive

Retrieve

Decision Support

Operational Business

Intelligence

Reporting & Performance Management

Modeling, Analytics & Simulation

Marts

DataS

tage O

ptimization / JA

QL

Data Explorer

Information Governance Metadata Data Lineage

Social Data Analytics

Machine Data Analytics

BigSheets

BigSQL

Streams

Discovery

Cluster

35 35

Maximize the business value of data

Archive

Production Data Warehouse [Hot Data]

Archive Data Warehouse [Warm Data]

Data Archives [Cold Data]

Reduce Costs Improve Performance

Minimize Risk

Reduce total cost of ownership of data

warehouse by intelligently archiving

and compressing historical data

Increase data warehouse

performance by archiving dormant data, leveraging a

tiered storage strategy

Support data retention needs, as well as legal

hold requirements within the data

warehouse

Aging Data Archive Data

IBM InfoSphere Optim


Hadoop

36

Data Warehouse Augmentation: Queryable Archive

Use Cases

§  Immediate storage alternative of cold data

§  Cost savings for cold data

§  Compliance requirements

§  Simple analytics / exploration

§  When you find new correlations, go back and re-mine the archive data to gain additional insight

Enables an immediate storage alternative. Queryable Archive often serves and initial step to more advanced integration with their EDW and advanced Hadoop analytics.

PureData System for Analytics

PureData System for Hadoop

37

§  Included application allows migration of data from PureData System for Analytics to PureData System for Hadoop at over 2TB/hr, out-of-the-box

§  Provides simple, built-in user interface to allow users to migrate data between systems easily

§  Enables quick configuration and scheduling of data migration §  Employs parallel processing between BigInsights and PDA/Netezza §  Leverages IBM-developed MapReduce programming for parallel processing §  Utilized Hive to allow for immediate access to migrated data

Optim EasyArchive for PureData System for Hadoop For Easy Data Provisioning from PureData System for Analytics


4

5

2

3




Review

IBM InfoSphere Optim Test Data Management

Requirements

Benefits •  Deploy new functionality

more quickly and with improved quality

•  Easily create & maintain test environments

•  Protect sensitive information from misuse & fraud with data masking

•  Accelerate test data provisioning through refresh & automation

•  Create referentially intact, “right-sized” test databases

•  Compare data across dev/test iterations to identify hidden errors

•  Protect confidential data used in test, training & development

•  Shorten iterative testing cycles and accelerate time to market

Create “right-size” environments with realistic data

for application testing & development

Test Data Management

100 GB

200 GB

1 TB

20 GB

20TB

Development

Unit Test

UAT Integration Test

-Subset -Mask

Production or Production Clone

-Refresh -Compare

Relational data sets

Test Data Management & Masking in warehouse environments

ü  Create or refresh targeted, “right-sized” subset test database more efficiently ü  Mask sensitive/confidential fields in-flight or in-place ü  Deploy multiple BI/Analytics/ETL test databases quickly when required ü  Maintain data referential integrity ü  Compare data across dev iterations & ETL transformations to test & validate faster

Production Environment Non-Production

TEST

DEV

QA

ü Extract ü Subset ü Mask ü Load ü Refresh ü Compare

InfoSphere Optim

Data Extract files

Improve PDA/Netezza DW test data delivery

Test Environment

Development Environment

Production Environment

“Masked” Gold Master

Subset & Mask

Subset/ Compare/ Refresh

Subset/ Compare/ Refresh



Archive Archive

Archive

Reduce Costs Reduce Risk Speed Delivery

Automate creation of realistic “right sized” test

data to reduce the number of defects caught

late in the test cycle

Mask sensitive information for

compliance to global and industry regulations and

protection

Refresh test data easily to speed the testing and

delivery of the data warehouse


4

5

2

3




Review

IBM InfoSphere Optim solves key data challenges

Identify Relevant & Sensitive Data Find what data must be retained, protected or removed

Optimize Test Data Automate and optimize the application test processes that rely on data

Dispose of Unnecessary Data Remove unnecessary data from

critical transactional or analytics applications

ê Costs êData Security Risk

é Availability é Application Performance

é Speed to make changes

Data

· Retain Essential Data Historical inactive data is

safely retained while easily accessible for reports and

compliance

Protect Sensitive Data Private Data: Customer IDs,

credit cards and financial data are masked or

redacted

Thank You Your feedback is important!

• Access the Conference Agenda Builder to complete your session surveys

o  Any web or mobile browser at http://iod13surveys.com/surveys.html

o  Any Agenda Builder kiosk onsite

Documents

Enabling Big Data with IBM InfoSphere Optim