View
4
Download
0
Category
Preview:
Citation preview
Harness the power of your data:Build a next generation data platform on AWS
Raghu PrabhuGlobal Manager, Business Development for Data Lakes
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Agenda
• Why do you need a next generation data platform?
• Why should you build on AWS?
• Business Outcomes & Sample Reference Architectures• Vaguard
• Epic Games
• Asurion
• Salesforce
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Why do you need a next generation data platform?
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Companies want more value from their data
Complications:
Siloed approaches don’t work anymore
It’s too expensive and limiting to store data on-premises
Data is:
Implication:
A new approach is needed to extract insights and value
Growing exponentially
From new sources
Increasingly diverse
Used by many people
Analyzed by many applications
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Traditionally, analytics looked like this
Relational data
GBs-TBs scale [not designed for PB/EBs]
Expensive: Large initial capex + $10K-$50K/TB/year
90% of data was thrown away because of cost
OLTP ERP CRM LOB
Data Warehouse
Business Intelligence
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Cloud data lakes are the future
Customers want:
Data Lake
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data demands are driving next generation architectures for analytics and innovation
Structured dataData that are highly normalized with common schema and stored in relational databases, powering transactional line-of-business applications
ERP CRM
LOB applications
Semistructured dataData that contain identifiers without conforming to a predefined schema
Mobile Social
Sensors POS terminals
Unstructured dataData that do not conform to a data model and are typically stored as individual files
Phone calls
Images
Videos Email
Batch loadExtracts data from various data sources at periodic intervals and moves them to the data lake
AWS Glue
StreamingIngests data that are generated from multiple sources such as log files, telemetry, mobile applications, and social networks
Amazon Kinesis
Amazon S3 data lakeCloud-scale centralized and scalable architecture that enables enterprise data science
Amazon S3
Amazon Redshift
And data stored in the data lake can also be made directly searchable and queryable
Amazon Athena
AnalyticsData Warehouses are repositories of normalized data and provide the foundational technology for BI
Amazon QuickSight
Amazon EMR
Amazon MSK
Machine LearningStoring data in an Amazon S3 data lake enables customers to leverage predictive or prescriptive analytics; perform ad-hoc analyses; and use AI/ML for automation and efficiency
Amazon SageMaker
AWS Deep Learning AMIs
Amazon EMR
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
On premises data
Web app data
Amazon RDS
Other databases
Streaming data
AWS Glue Data Catalog
AWS Glue CrawlerAmazon S3
Amazon Redshift Spectrum
AWS Glue ETL
Amazon Athena
Amazon EMR
Amazon QuickSight
Amazon SageMaker
AWS Lake Formation
Goal #1: Security and governance
layer
Goal #2: Manage S3 permissions for
Analytics
Goal #3: Help build easy data ingestion
pipelines
Your data Analytics and ML
Data Lakes on AWS
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Why should you build on AWS?
© 2020, Amazon Web Services, Inc. or its Affiliates.
Migration & Streaming Services
Infrastructure Data Catalog & ETL
Security & Management
Data Warehousing
Big DataProcessing
Interactive Query
Operational Analytics
Real timeAnalytics
Serverless
Data processing
Data movement
Analytics
Data lake infrastructure & management
Dashboards Predictive Analytics
Data, visualization, engagement, & machine learning
Digital User EngagementData
© 2020, Amazon Web Services, Inc. or its Affiliates.
Data movement
Analytics
Data lake infrastructure & management
Data, visualization, engagement, & machine learning
+ many more
RedshiftEMR (Spark & Hadoop)
AthenaElasticsearch Service
Kinesis Data Analytics
AWS Glue (Spark & Python)
S3/Glacier AWS GlueLake Formation
QuickSight SageMaker Comprehend Lex Polly Rekognition Translate
Database Migration Service | Snowball | Snowmobile | Kinesis Data Streams | Kinesis Data Firehose | Managed Streaming for Apache Kafka
PinpointData Exchange
NEW
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Most secureServices for security and governance
Compliance
AWS Artifact
Amazon Inspector
Amazon Cloud HSM
Amazon Cognito
AWS CloudTrail
Security
Amazon GuardDuty
AWS Shield
AWS WAF
Amazon Macie
VPC
Encryption
AWS Certification Manager
AWS Key Management Service
Encryption at rest
Encryption in transit
Bring your own keys, HSM support
Identity
AWS IAM
AWS SSO
Amazon Cloud Directory
AWS Directory Service
AWS Organizations
Customers need to have multiple levels of security, identity and access management, encryption, and compliance to secure their data lake
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Most secure — Certifications
CSACloud Security Alliance Controls
ISO 9001Global Quality Standard
ISO 27001Security Management Controls
ISO 27017Cloud Specific Controls
ISO 27018Personal Data Protection
PCI DSS Level 1Payment Card Standards
SOC 1Audit Controls Report
SOC 2Security, Availability, & Confidentiality Report
SOC 3General Controls Report
Global United States
CJISCriminal Justice Information Services
DoD SRGDoD Data Processing
FedRAMPGovernment Data Standards
FERPAEducational Privacy Act
FIPSGovernment Security Standards
FISMAFederal Information Security Management
GxPQuality Guidelines and Regulations
ISO FFIECFinancial Institutions Regulation
HIPPAProtected Health Information
ITARInternational Arms Regulations
MPAAProtected Media Content
NISTNational Institute of Standards and Technology
SEC Rule 17a-4(f)Financial DataStandards
VPAT/Section 508Accountability Standards
Asia Pacific
FISC [Japan]Financial Industry Information Systems
IRAP [Australia]Australian Security Standards
K-ISMS [Korea]Korean Information Security
MTCS Tier 3 [Singapore]Multi-Tier Cloud Security Standard
My Number Act [Japan]Personal Information Protection
Europe
C5 [Germany]Operational Security Attestation
Cyber Essentials Plus [UK]Cyber Threat
Protection
G-Cloud [UK]UK Government Standards
IT-Grundschutz[Germany]Baseline Protection
Methodology
X P
G
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Most cost effectiveDecouple compute and storage, choice of PAYG analytics services
Storage
S3 tiers & intelligent tiering
From $0.023 per GB/mo to as low as $0.004 per GB/mo
Compute
Spot & reserved instances
Save up to 90% off on-demand prices
EMR
Autoscaling
57% less thanon-premises per IDC report
Redshift
Less than a tenth of the cost of traditional solutions.
Athena & QuickSight
Serverless pay only for what is used
© 2020, Amazon Web Services, Inc. or its Affiliates.
More data lakes and analytics than anywhere elseTens of thousands of data lakes run on AWS across all industries
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Business Outcomes & Sample Reference Architectures
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Vanguard is the largest provider of mutual funds & second largest provider of exchange traded funds with $5.1 trillion in assets.
C HA L L E NGELoB’s are increasingly interested in analytics. IT needs to make ~ 1PB of data actionable for Fraud and Investment Fund analytics.
S OL U T IONArchitect a cloud-based data lakesolution with S3 and EMR. BI tools connect to Presto on EMR to democratize data.
RE S U LTEmpower 150+ users with a ‘Ready for Analytics’ environment of curated data sets to realize operational efficiency and with a phased approach reduced EC2 & EMR costs by $600k.
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Databases
Files
Vendor
Data
On-premise AWS Cloud
Region
Data lake
Process R1
Process R2
Load raw
Bucket R1
Bucket R2
Bucket R3
Raw data
Process C1
Process C2
Cleansing
Bucket C1
Bucket C2
Bucket C3
Cleansed data
Process T1
Process T2
Transform
Bucket T1
Bucket T2
Bucket T3
Ready for analytics
Vanguard Reference Architecture• Ingest 100+ data sources to S3 for
‘specific data domains.’
• Sqoop is used to bring data from DB’s and s tored in S3.
• Transient EMR clusters are used to clean Raw data, Transform data and perform Analytics.
• Data Scientists run models via Jupyter & Spark on EMR using curated data sets or go directly to the raw data in S3, as needed.
• A Hive metastore has all the metadata about the data and tables in the clus ter.
• SQL Analysts use Presto, Hue, Hive to perform ad hoc queries.
• LoB users (institutional, retail, international) access the curated data sets in EMR via Presto to their Tableau Server.
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Salesforce Marketing Cloud’s DMP captures, unifies, & activates data to strengthen relationships across every touchpoint
C H ALLENG EThe DMP has 40PT data and growing 4% WoW. Look-alike models need to run at scale
SO LUTIONEMR for data science, batch workloads and on-demand segmentation using Spark and MapReduce running 3,000 clusters (mostly transient) daily. Push results to S3, Redshift and RDS
R ESULT
Optimized pipelines publish to 100’s of API’s
Every 60 seconds, the DMP processes nearly:• 4.3M in user match requests• 1.6M in page views• 8.75M data capture events• 700k ad impressions
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Users
Client Partners
Ad hoc Queries/Insights
Redshift Spectrum
External APIs
Page
App
Data Processing
Event
(time-based)On-Demand
Segmentation
Data Science & Batch Workloads
NRT
Data Collection
Log Files
CRM
Data Ingestion
(Batch)
Data Ingestion
(Real Time Feed)
Page
App
• Data Collection. Routinely process 40PB of data aggregated from webpages, browsers, and mobi le apps. This data is ingested into S3 via a real-time pipeline us ing Kafka.
• Data Processing. Data is ingested us ing Spark Streaming. Segmentation Rules are kept in RDS. Us ing 3,000 transient EMR clusters with heavy dependency on Spot Instances and Instance Fleets, attributes are assigned and segmentation is completed. In rea l -time, they can send this data to their partners to leverage in their targeted ads using look-alike models.
• Data Analytics & Activation. EMR pushes to RedShift where insights are available to customers on the pages or apps for targeted ads to 100’s of partners.
Salesforce DMP Reference ArchitectureWith 3,000 transient clusters running at scale, Salesforce controlled costs using Spot Instances and Instance Fleets
Data Analytics & Activation
Fortnite is EPIC Games sensation attracting more than 140M players growing by 2PT/ month.
C H ALLENG EFortnite is free-to-play with its revenue coming entirely from in-game micro-transactions, meaning its revenue depends on continuously capturing the attention of gamers through new content and continuous innovation. To operate this way, it needs an up-to-the-minute insights of gamer experience.
S O LUTIONBuilt an AWS data lake using Kinesis to feed telemetry data to S3, EMR for Analytics and DynamoDB for fast querying.
R ES ULTGain up-to-the-minute understanding of gamer satisfaction to guarantee gamers are engaged, resulting in the most popular game played in the world.
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Near real-time pipeline
Batch pipelines
Grafana
Scoreboards API
Limited raw datareal time
ad-hoc SQL
Tableau/BI
Ad-hoc SQL
DynamoDB
Game clients
Game servers
Launcher
Game services
Kinesis
Spark on EMR
User ETL Metric definition
APIs
Other sources
S3
Databases
ETL using EMR
S3(Data lake)
• Entire analytics platform running on AWS.
• Amazon S3 leveraged as a data lake.
• Al l telemetry data is collected with Amazon Kinesis
• Large EMR cluster for the bulk of batch data processing & EMR Spark for rea l-time analytics.
• DynamoDB to create scoreboards and real-time queries.
• Game designers use data to inform their decisions including what to patch, which weapons to introduce or discontinue, and more.
Epic Games Reference ArchitectureEpic Games uses AWS Analytics platform to power the Fortnite game for 140M+ players
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Asurion is a leader in device protection and customer support solutions for smartphones, tablets, consumer electronics, and appliances.
CH A L L E N GENeed to analyze semi-structured and unstructured data sets from voice-to-text, claims data, and social from >290M customers for customer behavior insights.
S O L UT IO NUse S3 as a data lake to store all data in a single location with EMR processing raw data then pushed to Redshift for fast BI.
R E S ULTAchieved improved technical efficiencies and costs savings of 55% when compared to on-demand and 40% savings when compared to Reserved Instances.
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
• Asurian has PB scale platform collecting data from over 20+ sources from telephony, voice-to-text, claims data, and social media s ites.
• Real-time data i s streamed with Kinesis and Lambda.
• Raw data lands in S3 serving as the centra l repository and Data Lake.
• EMR is used to process and curate data us ing Apache Hive, Apache Spark, and Presto.
• The data lake supports:
• 1,000 Ad hoc queries/day
• 25+ Spark jobs
• 2PB of data
• Redshift provides fast analytics for BI tools.
Data from other applications
Data from OLTP
Application Events & Logging
Domain applications
EMR
Data preparation: Process raw data into meaningful content
Access Virtualization(AD and User
Profiles)
3rd party BI Tools
Orchestration – Jobs, Plans, Workflows –Enterprise Scheduler – Information Lifecycle Management
Real-time data
Redshift
DynamoDB
S3
Kinesis
Data Collection Services
(ODBC/ JDBC, CDC, Event Streaming,
APIs)
Asurian Reference ArchitectureAsurian completes customer behavior insights on 290M members using AWS Data Lake & Analytics services
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Thank You!
Recommended