Upload
vineet
View
177
Download
9
Embed Size (px)
Citation preview
Enabling Big Data with InfoSphere Optim Session # ILM-1742A
Vineet Goel, IBM Guenter Sauter, IBM [Product Management]
Please note IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice at IBM’s sole discretion.
Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision.
The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future products may not be incorporated into any contract. The development, release, and timing of any future features or functionality described for our products remains at our sole discretion.
Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user’s job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here.
Acknowledgements and Disclaimers Availability. References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates.
The workshops, sessions and materials have been prepared by IBM or the session speakers and reflect their own views. They are provided for informational purposes only, and are neither intended to, nor shall have the effect of being, legal or other guidance or advice to any participant. While efforts were made to verify the completeness and accuracy of the information contained in this presentation, it is provided AS-IS without warranty of any kind, express or implied. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this presentation or any other materials. Nothing contained in this presentation is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the use of IBM software.
All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics may vary by customer. Nothing contained in these materials is intended to, nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific sales, revenue growth or other results.
© Copyright IBM Corporation 2013. All rights reserved.
• U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
IBM, the IBM logo, ibm.com, InfoSphere, and Optim are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml
Other company, product, or service names may be trademarks or service marks of others.
Agenda
Need for Governing Big Data 1
4
5
2
3
Data Privacy for Big Data
Lifecycle Management for Big Data
Test Data Management for Big Data
Review
Make Informed Decisions
Uncover competitive advantages
Identify new opportunities
Rapid, easy access to big data, wherever it resides
Easy categorization, indexing, discovery of big data to optimize its usage
Definition and execution of governance appropriate to data value and intended use
Acting on Insight Requires Confidence in Data
Automated Integration Agile Governance Visual Context
Take Bigger, Calculated Risks
Information Integration & Governance for Big Data
IBM Information Integration and Governance portfolio for Big Data
6
Information Integration & Governance
Data Warehouse
Stream Computing
Hadoop System
Discovery Application Development
Systems Management
BIG DATA PLATFORMS
InfoSphere Guardium
InfoSphere Optim
InfoSphere Information Server Understand, Integrate, deliver and govern data across information systems
InfoSphere Master Data Management
Manage information through its lifecycle while meeting data privacy & retention compliance
Act on trusted views of your master data to improve your critical business processes
Monitor, protect and audit enterprise data to ensure security and compliance
Data Quality MDM Privacy &
Security Data
Lifecycle Information Integration
Open Architecture/ Multiple Product Entry Points
Information Ingestion
and Integration
Data
Exploration
Archive
Real-time Analytics
Information Governance, Security and
Business Continuity
Data
Exploration
Enterprise Warehouse
Data Marts
IBM Big Data and Analytics Reference Architecture
business users (with an idea), power users, data analysts
data scientist / data miner,
advanced business user,
application developer
traditional IT / application
developer
search & survey
exploratory analysis
operational
l text search l simple investigations l peek / poke
l from mountain of data into a structured world with apps to provide business value
l iterative in nature, many false starts, needs many skill sets/people
l creating/standing up applications, processes, systems with enterprise characteristics
l more formal environment, SLAs, etc
Big Data life cycle – from raw to production
Initial / exploratory use cases
Used for business decisions
Requirements change over the course of the life cycle
Little security concerns Protect, Secure, Encrypt
Sporadic change management Audit trail tracking access & changes
No data retention requirements Preserve data for N years
Little to no regulation Legislated requirements
No / isolated data quality concerns Data quality imperatives
Sources of information are “interesting” Sources must be trusted
Big Data Best practice processes
People
Process
Technology It’s not all about technology…
Information Integration & Governance Technology
IBM Information Governance Unified Process
IBM Governance Process
Overview: IBM InfoSphere Optim
Production
Dev/Test
Archive
Discover Understand
Classify
§ Archive cold data to improve application performance & streamline backups
§ Reduce hardware, software, storage & maintenance costs for enterprise applications
§ Support data retention regulations & safely retire legacy/redundant applications
Data Archiving
§ Reduce cost, reduce risk and speed application delivery by provisioning right-sized test environments
§ Ensure compliance & privacy with test data masking
Test Data Management
§ Accelerate data management projects by discovering complex data relationships & sensitive data elements in your data assets
Discovery
• Archive • Retire
DATA
DATA
• Mask • Subset • Compare • Refresh Data Masking
§ Ensure data privacy compliance by masking sensitive data
Need for Governing Big Data 1
4
5
2
3
Data Privacy for Big Data
Lifecycle Management for Big Data
Test Data Management for Big Data
Review
13
Data Privacy Challenges & Considerations
Ø Customers take “data privacy” seriously! Ø Organizations need to de-identify, mask and
transform sensitive data in data environments to avoid issues of data breach Ø Privileged access misuse, data theft,
data movement across data centers or hosted environments, outside contractors or offshore project teams
Ø Apply transformation techniques to substitute sensitive data with contextually-accurate but fictionalized data to produce accurate test results
Ø Support compliance with local, state, national, international and industry-based privacy regulations
Keeping up with Global & Industry Regulations
Canada: Personal Information Protection
& Electronics Document Act
USA: Federal, Financial & Healthcare
Industry Regulations & State Laws
Mexico: E-Commerce Law
Colombia: Political Constitution –
Article 15
Brazil: Constitution, Habeas Data &
Code of Consumer Protection & Defense
Chile: Protection of
Personal Data Act Argentina:
Habeas Data Act
South Africa: Promotion of Access
to Information Act
United Kingdom: Data Protection
Act
EU: Protection Directive
Switzerland: Federal Law on Data Protection
Germany: Federal Data Protection
Act & State Laws
Poland: Polish
Constitution
Israel: Protection of Privacy Law
Pakistan: Banking Companies
Ordinance
Russia: Computerization & Protection of Information
/ Participation in Int’l Info Exchange
China Commercial Banking Law
Korea: 3 Acts for Financial
Data Privacy
Hong Kong: Privacy Ordinance
Taiwan: Computer- Processed
Personal Data Protection Law
Japan: Guidelines for the
Protection of Computer Processed Personal Data
India: SEC Board of
India Act
Vietnam: Banking Law
Philippines: Secrecy of Bank
Deposit Act Australia:
Federal Privacy Amendment Bill
Singapore: Monetary Authority of
Singapore Act
Indonesia: Bank Secrecy Regulation 8
New Zealand: Privacy Act
Industry Regulations like: • PCI-DSS • HIPAA • GLB PII such as: • Names • Account # • CCN • SSN • DOB • Addresses • Driving Lic • IP Address • Medical • Telephone #
Optim & Redaction Guardium Business Info Exchange
Monitor, Audit & Secure
Discover, Define & Collaborate
Mask & Protect
New IBM Offering: InfoSphere Data Privacy for Hadoop
Share business glossary, privacy policies, project
blueprints Protect structured and
unstructured data
De-identify sensitive data at source or within
Hadoop
Centralized reporting of audit data
Enforce security policies
Explore data lineage
Discover relationships & sensitive data Monitor & audit
activities in Hadoop
Business Information Exchange
§ Facilitate business & IT communications via a common business vocabulary
§ Specify information governance policies and rules
§ Understand where data comes from and where it goes
Requirements
Benefits § Facilitates collaboration
on reference architectures, leveraging the same vocabulary
§ Aligns the efforts of IT with goals of the business
Collaborate on big data reference architecture and define a common business language
Business Info Exchange
17
What is data masking? q Definition
Method for creating a structurally similar but inauthentic version of an organization's data. The purpose is to protect the actual data while having a functional substitute for occasions when the real data is not required.
q Requirement Effective data masking requires data to be altered in a way that the actual values cannot be determined or reengineered, functional appearance is maintained.
q Other Terms Used Obfuscation, scrambling, data de-identification
q Commonly masked data types Name, address, telephone, SSN/national identity number, credit card number
q Methods o Static Masking: Obfuscating data values that ultimately get persisted in the
updated database. Often rows are moved and masked as a single operation, though data may be updated in place.
o Dynamic Masking: Masks specific data elements on the fly without modifying the applications or physical production data store.
18
Data Masking
InfoSphere Optim
Mask
Before Masking After Masking
§ Protect sensitive information (PII) from misuse and fraud and data breaches
§ Protect confidential data while preserving analytics
§ Achieve better information governance & regulations compliance
§ Mask data in dbms, delimited text files, or in ETL
§ Mask sensitive data in Hadoop using MapReduce
§ Proven masking algorithms
§ Callable masking APIs
Requirements
Benefits CSV More…
Hadoop
Anonymize sensitive information used in Hadoop with realistic but
fictional data
Mask at the source Mask in-‐flight Mask in-‐Hadoop (MapReduce)
19
Example 2 Example 1
PersNbr FstNEvtOwn LstNEvtOwn 27645 Elliot Flynn 27645 Elliot Flynn
Event Table
PersNbr FstNEvtOwn LstNEvtOwn 10002 Pablo Picasso
10002 Pablo Picasso
Event Table
Personal Info Table
PersNbr FirstName LastName 08054 Alice Bennett 19101 Carl Davis 27645 Elliot Flynn
Personal Info Table
PersNbr FirstName LastName 10000 Jeanne Renoir 10001 Claude Monet 10002 Pablo Picasso
InfoSphere Optim Data Masking Techniques
A comprehensive set of data masking techniques to transform or de-identify data, including: v String literal values v Random or Sequential numbers v Lookup / Hashing v Credit Cards
v Arithmetic expressions v Concatenate or Substring v Format-Preserving v National ID/ SSN
v Shuffling v Date Variance v User Defined v Email
Referential integrity is maintained with key propagation
Customer Information
Patient No. SSN
Name
Address
City State Zip
112233 123-45-6789
Amanda Winters
40 Bayberry Drive
Elgin IL 60123
123456 333-22-4444
Erica Schafer
12 Murray Court
Austin TX 78704
Data is masked with contextually correct data to preserve integrity of test data
20
Data Masking in-Hadoop leveraging MapReduce
• Data Masking application can run natively in Hadoop clusters using the standard MapReduce technology for highly “scalable processing”
• Support for masking delimited files in HDFS
• Data masking libraries are exposed via Java API and invoked in the Reducer
Hadoop Cluster
masked Data files
MapReduce based Masking
Application
APIs
Data files
Optim Masking Application in Hadoop
Ø 2.95 Millions Elements Masked per Second
Ø 2.56 Billion Elements masked in ~15 minutes
Pure
Data
Sys
tem
fo
r Had
oop
1.14
1.45
1.72
1.98 2.05
2.95 2.88
0
0.5
1
1.5
2
2.5
3
Elem
ents
Mas
ked
per s
ec (i
n M
illio
n)
80 M 160 M 320 M 640 M 960 M 1.92 B 2.56 B# of Records submitted (in Millions/Billions)
PureData for Hadoop (BigInsights):Masking in Hadoop MapReduce Application
18 node cluster
For Web Logs, Clickstream Analysis
User IDs, Birth Date
23
For XML Data references
<?xml version="1.0" encoding="utf-8"?> <customers> <customer> <!-- All Valid and Present --> <first_name>Bobby</first_name> <middle_initial>J</middle_initial> <last_name>Fudge</last_name> <address> < street>100 Fifth Avenue</street> <city>New York</city> <state>NY</state> <zip>10014</zip> </address> <ccn>5411116857029116</ccn> <telephone>1-609-156-5648 </telephone> <email_address> [email protected] </email_address> </customer> </customers> © 2012 IBM Corporation
Before XML Document After XML Document <?xml version="1.0" encoding="utf-8"?> <customers> <customer> <!-- All Valid and Present --> <first_name>Bobby</first_name> <middle_initial>J</middle_initial> <last_name>Fudge</last_name> <address> <street>100 Fifth Avenue</street> <city>New York</city> <state>NY</state> <zip>10014</zip> </address> <ccn>5411110000000017</ccn> <telephone>1-609-321-7654 </telephone> <email_address> [email protected] </email_address> </customer> </customers>
24
For Data in NoSQL, Internet Commerce
{ name : "Matt Kalan", title : ["Account Manager", "Solutions Architect"], phone : "+1 347 688-5694", location : "New York, NY", email : "[email protected]", web : ["mongodb.com", "Mongodb.org"], linkedin : ["mkalan", "Mongodb"] twitter : ["@MatthewKalan", "@MongoDB", "@MongoDBInc"], facebook : ["MongoDB", "MongoDB, Inc."] }
}
{ name : "Matt Kalan", title : ["Account Manager", "Solutions Architect"], phone : "+1 347 654-1234", location : "New York, NY", email : “[email protected]", web : ["mongodb.com", "Mongodb.org"], linkedin : ["mkalan", "Mongodb"] twitter : ["@MatthewKalan", "@MongoDB", "@MongoDBInc"], facebook : ["MongoDB", "MongoDB, Inc."] }
}
For Call Data Records, Mobile Apps Phone numbers, Call history
IMEI
Data Redaction
§ Protect unstructured data in textual, graphical and form based documents
§ Control data views with user role policies
§ Automate batch workflow process with optional human review
Requirements
Benefits § Prevent unintentional
data disclosure
§ Comply with regulatory and corporate compliance standards
§ Increase efficiency and reduce risk via automation
Protect sensitive unstructured data in documents, forms & text
Data Redaction
Date: April 12, 2007 Patient Name: John Smith
Date of Birth: June 05, 1962 Social Security Number: 035-01-1271
Ref No. MR 2335/324 Insurance Provider Aetna
Background: Mr. John Smith was admitted to Sioux General Hospital at
05:15 AM on 12 April 2001, transferred from Brookdale Psychiatric Hospital after a fall as a result of a left-side
weakness. …
Redact/ Mask
For Text Logs, Mobile Apps or Customer Service Experience
Ability to parse unstructured, structure and semi-structured content: - Voice to Text Logs - Agent Notes - Text Chats - Social media feeds
Agent: “Mr Smith, let me verify the phone number associated with your account?” Customer: “408-555-1212” Agent: “Thank you. Let’s discuss the problem you are having with your iPhone 5 and the battery issue”…
Agent: “[NAME], let me verify the phone number associated with your account?” Customer: “[PHONE]” Agent: “Thank you. Let’s discuss the problem you are having with your iPhone 5 and the battery issue”…
Hadoop Activity Monitoring
§ Protect sensitive information from misuse and fraud
§ Prevent data breaches and associated fines
§ Achieve better information governance & security
Monitor & Audit Key Hadoop events:
§ Session and User Information
§ HDFS Operations – Commands, Files, Permis.
§ MapReduce Jobs § Exceptions like
authorization failures § Hive/HBase queries
Requirements
Benefits
Monitor and audit Hadoop activity in real-time to support compliance requirements and
protect data
InfoSphere Guardium Collector Appliance
S-TAPs
• Who is submi;ng specific requests? • What MapReduce jobs are they running? • Are jobs part of an authorized programs? • Too many file permission excepGons?
Hadoop
Need for Governing Big Data 1
4
5
2
3
Data Privacy for Big Data
Lifecycle Management for Big Data
Test Data Management for Big Data
Review
30 30
Organizations have been increasingly challenged with successfully managing data growth
Increasing Costs Poor Data Analysis Performance
Manage Risk & Compliance
Business users wait for analytic query responses; slow-performing business intelligence (BI) solutions
impact business agility
Supporting the data retention and legal hold requirements for large volumes of data.
The volume of growth impacts the warehouse capacity, where traditional strategies may not
be enough
Integrate big data and data warehouse capabilities to increase operational efficiency"
Extend warehouse infrastructure • Optimize storage, maintenance and licensing
costs by migrating rarely used data to Hadoop • Query-able access to data • Governance and Policy-driven archiving
Challenges
ü Are you drowning in very large data sets (TBs to PBs)?
ü Do you use your warehouse environment as a repository for ALL data?
ü Do you have a lot of cold, or inactive data in your database?
ü Are you having to throw data away because you’re unable to store or process it?
ü Are you interested in using your data for traditional and new types of analytics?
Data Warehouse Augmentation – Queryable Archive
Data Archiving
InfoSphere Optim
Hadoop
Archive data into storage of choice. Manage data growth, lower cost &
meet retention compliance.
-‐ Archive/Purge -‐ Heterogeneous
-‐ Compressed -‐ Immutable
Query-‐able & analyGcal store
• Capture complete business object • Preserve Data Integrity • Preserve Schema Metadata • Apply RetenGon / Hold Policies • Load data into Hadoop for analyGcs
Archive files
§ Reduce hardware, storage and maintenance costs of traditional dbms’s
§ Improve performance of traditional systems by offloading inactive data
§ Data access from Hadoop’s query-able/analytical store
§ Discover, archive, query, retain and purge data per business policies
§ Native connectivity, complete business objects, referential integrity
§ Augment data warehouses & offload cold data to lower cost platform
Requirements
Benefits IMS VSAM More…
Archive/Offload data into Hadoop Manage data growth, Lower TCO & Meet data retention compliance
ü Apply Retention / Hold Policies ü Capture complete business object ü Preserve Data Integrity ü Preserve Schema Metadata ü Load data into Hadoop as needed
Archive Cold Data
Query-‐able analyGcal data store, using Hadoop Archive & Purge Data
InfoSphere Optim
Compressed, immutable, auditable & restorable archives
Database
IMS VSAM More…
Archive files Hadoop
SQL Access
Data Warehouse
Data Warehouse Augmentation Architecture Overview
BigInsights
Sources
Optim Data
Growth
Archive
Retrieve
Decision Support
Operational Business
Intelligence
Reporting & Performance Management
Modeling, Analytics & Simulation
Marts
DataS
tage O
ptimization / JA
QL
Data Explorer
Information Governance Metadata Data Lineage
Social Data Analytics
Machine Data Analytics
BigSheets
BigSQL
Streams
Discovery
Cluster
35 35
Maximize the business value of data
Archive
Production Data Warehouse [Hot Data]
Archive Data Warehouse [Warm Data]
Data Archives [Cold Data]
Reduce Costs Improve Performance
Minimize Risk
Reduce total cost of ownership of data
warehouse by intelligently archiving
and compressing historical data
Increase data warehouse
performance by archiving dormant data, leveraging a
tiered storage strategy
Support data retention needs, as well as legal
hold requirements within the data
warehouse
Aging Data Archive Data
IBM InfoSphere Optim
IBM InfoSphere Optim
Hadoop
36
Data Warehouse Augmentation: Queryable Archive
Use Cases
§ Immediate storage alternative of cold data
§ Cost savings for cold data
§ Compliance requirements
§ Simple analytics / exploration
§ When you find new correlations, go back and re-mine the archive data to gain additional insight
Enables an immediate storage alternative. Queryable Archive often serves and initial step to more advanced integration with their EDW and advanced Hadoop analytics.
PureData System for Analytics
PureData System for Hadoop
37
§ Included application allows migration of data from PureData System for Analytics to PureData System for Hadoop at over 2TB/hr, out-of-the-box
§ Provides simple, built-in user interface to allow users to migrate data between systems easily
§ Enables quick configuration and scheduling of data migration § Employs parallel processing between BigInsights and PDA/Netezza § Leverages IBM-developed MapReduce programming for parallel processing § Utilized Hive to allow for immediate access to migrated data
Optim EasyArchive for PureData System for Hadoop For Easy Data Provisioning from PureData System for Analytics
Need for Governing Big Data 1
4
5
2
3
Data Privacy for Big Data
Lifecycle Management for Big Data
Test Data Management for Big Data
Review
IBM InfoSphere Optim Test Data Management
Requirements
Benefits • Deploy new functionality
more quickly and with improved quality
• Easily create & maintain test environments
• Protect sensitive information from misuse & fraud with data masking
• Accelerate test data provisioning through refresh & automation
• Create referentially intact, “right-sized” test databases
• Compare data across dev/test iterations to identify hidden errors
• Protect confidential data used in test, training & development
• Shorten iterative testing cycles and accelerate time to market
Create “right-size” environments with realistic data
for application testing & development
Test Data Management
100 GB
200 GB
1 TB
20 GB
20TB
Development
Unit Test
UAT Integration Test
-Subset -Mask
Production or Production Clone
-Refresh -Compare
Relational data sets
Test Data Management & Masking in warehouse environments
ü Create or refresh targeted, “right-sized” subset test database more efficiently ü Mask sensitive/confidential fields in-flight or in-place ü Deploy multiple BI/Analytics/ETL test databases quickly when required ü Maintain data referential integrity ü Compare data across dev iterations & ETL transformations to test & validate faster
Production Environment Non-Production
TEST
DEV
QA
ü Extract ü Subset ü Mask ü Load ü Refresh ü Compare
InfoSphere Optim
Data Extract files
Improve PDA/Netezza DW test data delivery
Test Environment
Development Environment
Production Environment
“Masked” Gold Master
Subset & Mask
Subset/ Compare/ Refresh
Subset/ Compare/ Refresh
IBM InfoSphere Optim
IBM InfoSphere Optim
Archive Archive
Archive
Reduce Costs Reduce Risk Speed Delivery
Automate creation of realistic “right sized” test
data to reduce the number of defects caught
late in the test cycle
Mask sensitive information for
compliance to global and industry regulations and
protection
Refresh test data easily to speed the testing and
delivery of the data warehouse
Need for Governing Big Data 1
4
5
2
3
Data Privacy for Big Data
Lifecycle Management for Big Data
Test Data Management for Big Data
Review
IBM InfoSphere Optim solves key data challenges
Identify Relevant & Sensitive Data Find what data must be retained, protected or removed
Optimize Test Data Automate and optimize the application test processes that rely on data
Dispose of Unnecessary Data Remove unnecessary data from
critical transactional or analytics applications
ê Costs êData Security Risk
é Availability é Application Performance
é Speed to make changes
Data
· Retain Essential Data Historical inactive data is
safely retained while easily accessible for reports and
compliance
Protect Sensitive Data Private Data: Customer IDs,
credit cards and financial data are masked or
redacted
Thank You Your feedback is important!
• Access the Conference Agenda Builder to complete your session surveys
o Any web or mobile browser at http://iod13surveys.com/surveys.html
o Any Agenda Builder kiosk onsite