James Sturrock, Operations Manager February 15, 2018
MySQL at Mastercard
©20
18 M
aste
rcar
d. P
ropr
ieta
ry a
nd C
onfid
entia
l.
2 JANUARY 19, 2018
• We employ over 13,000 people worldwide
• One of the most recognizable brands in the world
• Our vision is “a World Beyond Cash™“
• Our mission is: Every day, everywhere, we use our technology and expertise to make payments safe, simple and smart
About us
©20
18 M
aste
rcar
d. P
ropr
ieta
ry a
nd C
onfid
entia
l.
3 JANUARY 19, 2018
• James Sturrock
• Operations Manager
• With Mastercard for over 7 years
• Part of the Payment Gateway Services division
Who am I?
©20
18 M
aste
rcar
d. P
ropr
ieta
ry a
nd C
onfid
entia
l.
4 JANUARY 19, 2018
• Our Payment Gateways processes financial transitions for merchants globally, across a variety of sectors such as: – Ecommerce – major online brands – Airlines – Cardholder Present – pub and restaurant chains,
high-street stores etc
• We bridge the gap between your bank authorizing a payment and the merchant receiving the funds
• Due to the nature of our business, operationally we must focus on maintaining three key objectives: 1. Security – we handle peoples personal data as well as cardholder
data 2. Stability – huge financial and reputational cost to merchants if
people can’t buy things 3. Scalability – we need to ensure we can always cope with
unexpected surges in traffic (Black Friday, Sporting Events etc)
What we do
©20
18 M
aste
rcar
d. P
ropr
ieta
ry a
nd C
onfid
entia
l.
5 JANUARY 19, 2018
• MySQL was a good fit for our Linux environment and open source approach
• Flexibility to use it in whatever way you need to
• Stability, MySQL is almost never the problem!
• Simplicity, MySQL can be used as simply or complex as you want
Why MySQL
©20
18 M
aste
rcar
d. P
ropr
ieta
ry a
nd C
onfid
entia
l.
6 JANUARY 19, 2018
– Enterprise Monitor – Enterprise Authentication
– Enterprise Scalability
• These are all products which we are now using or evaluating! ensure that 3rd party vendor releases patches for security vulnerabilities in a timely manner. This can’t be guaranteed from the open source community.
• Traditionally we have failed to take advantage of the full suite of Enterprise tools such as: – Enterprise Monitor – Enterprise Authentication – Enterprise Scalability
• These are all products which we are now using or evaluating!
Why MySQL Enterprise
©20
18 M
aste
rcar
d. P
ropr
ieta
ry a
nd C
onfid
entia
l.
7 JANUARY 19, 2018
• Around 40 servers running MySQL
• Anywhere between 1 and 12 running instances of MySQL on a single machine
• Vast majority are running MySQL Enterprise Edition
• All running on Red Hat Enterprise Linux (64 bit)
General Overview
Hardware
Presentation
Operating System
Database
©20
18 M
aste
rcar
d. P
ropr
ieta
ry a
nd C
onfid
entia
l.
8 JANUARY 19, 2018
• We use MySQL Enterprise Monitor along with some legacy in house log monitoring tools
• Nagios is used for system level monitoring as well as basic MySQL checks (such as are instances running, is replication stalled, how far behind is replication)
• Grafana used for monitoring “user experience” of the platform, often the best indicator if there is an actual problem
Monitoring
©20
18 M
aste
rcar
d. P
ropr
ieta
ry a
nd C
onfid
entia
l.
9 JANUARY 19, 2018
• Classic “upside down tree” replication chain, A single read/write master replicates down the chain one by one
• Having too many slaves replicating off one master can slow down the master!
• When carrying out a failover, there is much less remastering to be done
• Allows for us to carry out major schema upgrades on all slaves then failover with no downtime
Replication
Host 1
Host 2
Host 3
Host 4 Host X
©20
18 M
aste
rcar
d. P
ropr
ieta
ry a
nd C
onfid
entia
l.
10 JANUARY 19, 2018
• Our replication structure and database design doesn’t give us high availability out of the box, there is still a single read/write master
• Red Hat Cluster Suite layered on top of MySQL to provide automated failure detection and failover
• Built in clustering and quorum functionality
• Essentially manages a VIP and ensures it is running on the correct host
• Custom health checks are run by the cluster software to determine if a MySQL instance or the entire host has crashed
High Availability
©20
18 M
aste
rcar
d. P
ropr
ieta
ry a
nd C
onfid
entia
l.
11 JANUARY 19, 2018
• Replication lag during peak processing periods – Potentially could be fixed by parallel replication
• Cumbersome process to isolate databases for Kernel patching – Potentially could be fixed by using GTID replication – Potentially could be fixed by using tools like salt/fabric to automate
• Length of time for a cold started database to become “hot” and fast enough to use – Potentially could be fixed by migrating to InnoDB
Current Challenges…
©20
18 M
aste
rcar
d. P
ropr
ieta
ry a
nd C
onfid
entia
l.
12 JANUARY 19, 2018
• Compliance considerations: – MySQL 8? – RHEL 7?
• Performance/usability improvements: – Implement GTID replication – Test parallel replication
• Tighter integration of MySQL into our “DevOps” toolkit – Puppet – Fabric/salt
The Future…..
©20
18 M
aste
rcar
d. P
ropr
ieta
ry a
nd C
onfid
entia
l.
13 JANUARY 19, 2018
• Use SSD disks where possible!
• Always test schema changes on a dataset equivalent to production (and test the rollback as well as the rollout)
• You can never have too many monitoring metrics across your platform
• Having a production like stress test environment is invaluable
• Historically MySQL has not been the problem, hardware and software bottlenecks are more common
• Disconnect database connections when reaching out to 3rd party services (avoids rapidly reaching the max_connection limit)
Lessons Learned…