CERN - Real Life RAC Scalability Experiences

CERN IT DepartmentCH-1211 Genève 23

Switzerlandwww.cern.ch/it

Eric Grancher eric.grancher@cern.ch

CERN IT

Anton Topurovanton.topurov@cern.ch

openlab, CERN IT

Real life experiences of RAC scalability with a 6 node cluster

Outline

• CERN computing challenge• Oracle RDBMS and Oracle RAC @ CERN• RAC scalability – what, why and how?• Real life scalability examples• Conclusions• References

LHC gets ready …

Jürgen Knobloch / CERN

The LHC Computing Challenge

• Signal/Noise 10-9

• Data volume– High rate * large number of

channels * 4 experiments

� 15 PetaBytes of new data each year

• Compute power– Event complexity * Nb.

events * thousands users� 100 k of (today's) fastest

• Worldwide analysis & funding– Computing funding locally

in major regions & countries

– Efficient analysis everywhere

� GRID technologyJürgen Knobloch / CERN

Switzerlandwww.cern.ch/it Jürgen Knobloch/CERN Slide 5

WLCG Collaboration

• The Collaboration– 4 LHC experiments

– ~250 computing centres– 12 large centres

(Tier-0, Tier-1)– 38 federations of smaller

“Tier-2” centres

• Growing to ~40 countries– Grids: EGEE, OSG, Nordugrid

• Technical Design Reports– WLCG, 4 Experiments: June 2005

• Memorandum of Understanding– Agreed in October 2005

• Resources– 5-year forward look

Centers around the world form a Supercomputer

• The EGEE and OSG projects are the basis of the Worldwide LHC Computing Grid Project WLCG

Inter-operation between Grids is working!Jürgen Knobloch / CERN

Switzerlandwww.cern.ch/it Jürgen Knobloch/CERN Slide 7

CPU & Disk Requirements 2006

2007 2008 2009 2010

LHCb-Tier-2CMS-Tier-2ATLAS-Tier-2ALICE-Tier-2LHCb-Tier-1CMS-Tier-1ATLAS-Tier-1ALICE-Tier-1LHCb-CERNCMS-CERNATLAS-CERNALICE-CERN

2007 2008 2009 2010

CPU Disk

CERN: ~ 10%

Oracle databases at CERN

• 1982 : CERN starts using Oracle• 1996: OPS on Sun SPARC Solaris• 2000: Use of Linux x86 for Oracle RDBMS

• Today : Oracle RAC for most demanding services:– CASTOR mass storage system (15 PB / year)– Administrative applications (AIS)– Accelerators and controls etc.– LHC Computing Grid (LCG)

(our view of) RAC basics

• Shared disk infrastructure, all disk devices have to be accessible from all servers

• Shared buffer cache (with coherency!)

Clients DB Servers Storage

Linux RAC deployment example

• RAC for Administrative Services move (2006)– 4 nodes x 4 CPUs, 16GB RAM/node– RedHat Enterprise Linux 4 / x86_64– Oracle RAC 10.2.0.3

• Consolidation of administrative applications, first:– CERN Expenditure Tracking– HR management, planning and follow-up as well as a self

service application– Computer resource allocation, central repository of

information (“foundation”)…

• Reduce number of databases, profit from HA, profit from information sharing between the applications (same database).

Linux RAC deployment for administrative applications

• Use of Oracle 10g services– One service per application OLTP workload– One service per application batch workload

– One service per application external access

• Use of Virtual IPs• Raw devices only for OCR and quorum devices• ASM disk management (2 disk groups)• RMAN backup to TSM

• Things became must easier and stable over time (GC_FILES_TO_LOCKS, shared raw device volumes …) , good experience with consolidation on RAC

RAC Scalability

Frits Ahlefeldt-Laurvig / http://downloadillustration.com/

RAC Scalability (1)

2 ways of upgrading database performance

Upgrading the hardware (scale up)- Expensive and inflexible

Adding more hardware (scale out)- Less expensive hardware and more flexible

RAC Scalability (2)

Adding more hardware (scale out)Pros:- Cheap and flexible- Getting more popular- Oracle RAC is there to help youCons:- More is not always better- Achieving desired scalability is not easy.

Reason 1

• Shared disk infrastructure, all disk devices have to be accessible from all servers

• Shared buffer cache (with coherency!)

Clients DB Servers Storage

Reason 2

Reason 3

Examples

2 real life RAC scalability examples

• CASTOR Name Server

• PVSS

CASTOR Name Server

• CASTOR– CERN Advanced STORage manager– Store physics production files and user files

• CASTOR Name Server– Implements hierarchical view of CASTOR name

space– Multithreaded software– Uses Oracle Database for storing files metadata

Stress Test Application

• Multithreaded • Used with up to 40 threads • Each thread loops 5000 times on

– Creating a file– Checking it’s parameters– Changing size of the file

• Test Made:– Single instance vs. 2 nodes RAC– No changes in schema and application code

Result

CNS : Single instance vs RAC

1 2 5 7 10 12 14 16 20 25 30 35 40

Threads

Single instance

RAC, 2 instances

Analysis 1/2

Problem:• Contention on CNS_FILE_METADATA table

Change:• Hash partition with local PK index

Result: 10% gain, but still worse than single instance

Analysis (2/2)

• Top event:– enq: TX - row lock contention– Again on the CNS_FILE_METADATA

• Findings:– Application logic causes row lock contention– Table structure reorganization can’t help

• Follow – up– No simple solution– Work in progress now

PVSS II

• Commercial SCADA application – critical for LHC and experiments

• Archiving in Oracle Database

• Out of the box performance:100 “changes” per second

• CERN needs: 150 000 changes per second = 1500 times faster!

The Tuning Process

1. run the workload, gather ASH/AWR information, 10046…

2. find the top event that slows down the processing

3. understand why time is spent on this event

4. modify client code, database schema, database code, hardware configuration

PVSS Tuning (1/6)

• Shared resource :EVENTS_HISTORY (ELEMENT_ID, VALUE…)

• Each client “measures” input and registers history with a “merge” operation in the EVENTS_HISTORY table

Performance:• 100 “changes” per second

Tableevents_history

trigger on update eventlast… merge (…)

150 Clients DB Servers Storage

Update eventlastval set …

Tableeventlastval

PVSS Tuning (2/6)

Initial state observation:

• database is waiting on the clients “SQL*Net message from client ”

• Use of a generic library C++/DB• Individual insert (one statement per entry) • Update of a table which keeps “latest state”

through a trigger

PVSS Tuning (3/6)

Changes: • bulk insert to a temporary table with OCCI, then call PL/SQL

to load data into history table

Performance:• 2000 changes per second

Now top event: “db file sequential read ”

System I/O3.73123,951db file parallel write

System I/O5.81191,133log file parallel write

18.8861CPU time

Other37.2212041enq: TX - contention

User I/O42.5613729,242db file sequential read

Wait ClassPercent Total DB TimeTime(s)WaitsEventawrrpt_1_5489_5490.html

PVSS Tuning (4/6)

Changes: • Index usage analysis and reduction• Table structure changes. IOT. • Replacement of merge by insert. • Use of “direct path load” with ETLPerformance:• 16 000 “changes” per second • Now top event: cluster related wait event

test5_rac_node1_8709_8710.html

Cluster8.62198118,454gc current block 2-way

Cluster9.9922824,370gc current grant busy

Cluster11.1372556,818gc current block busy

16.0369CPU time

Cluster31.62672827,883gc buffer busy

Wait Class% Total Call Time

Avg Wait(ms)Time(s)WaitsEvent

PVSS Tuning (5/6)

Changes:• Each “client” receives a unique number.• Partitioned table. • Use of “direct path load” to the partition with ETL Performance:• 150 000 changes per second • Now top event : “freezes ” once upon a while

rate75000_awrrpt_2_872_873.html

Configuration3.6088785,439undo segment extension

System I/O4.5711091,542log file parallel write

5.1123CPU time

Cluster6.4221557,218gc current multi block request

Concurrency27.6818665813row cache lock

Wait Class% Total Call TimeAvgWait(ms)

Time(s)WaitsEvent

PVSS Tuning (6/6)

Problem investigation:• Link between foreground process and ASM processes• Difficult to interpret ASH report, 10046 trace

Problem identification:• ASM space allocation is blocking some operations

Changes: • Space pre-allocation, background task.

Result:• Stable 150 000 “changes” per second.

PVSS Tuning Schema

150 Clients DB Servers Storage

Tableevents_history

PL/SQL: insert /*+ APPEND */

into eventh (… ) partition

PARTITION (1)select …

from temp

Temptable

Tableevents_history

trigger on update eventlast… merge (…)

Tableeventlastval

Update eventlastval set …

Bulk insert into temp table

PVSS Tuning Summary

Conclusion: • from 100 changes per second to 150 000

“changes” per second • 6 nodes RAC (dual CPU, 4GB RAM), 32

disks SATA with FCP link to host• 4 months effort:

– Re-writing of part of the application with changes interface (C++ code)

– Changes of the database code (PL/SQL)– Schema change– Numerous work sessions, joint work with other

CERN IT groups

Scalability Conclusions

• ASM / cluster filesystem / NAS allow– a much easier deployment– way less complexity and human error risk.

• 10g RAC is much easier to tune• RAC can boost your application performance, but

also disclose the weak design points and magnify their impact

• Proper application design is the key to almost linear scalability for a “non-read-only” application

Recommendations

• Understand the extra features• Take advantage of RAC connections

management– Use « services » with resource management.– Load balancing : client side, server side, client +

server side…

• Use connection failover mechanisms: – « Transparent Application Failover » re-

connection, loss of modifications non yet committed.

– « Fast Connection Failover » (10g), notification, client side to be implemented.

– Minimum: re-connection.

The Message to You

• RAC can scale even for write intensive applications

• As you know, RAC scalability tuning is challenging

• Create RAC-aware applications from the beginning– Application design is key

• The effort is paid back:– With better application performance– With highly available system– With flexibility and lower price for hardware

References

• “Pro Oracle Database 10g RAC on Linux”Julian Dyke and Steve Shaw

• “Connection management in RAC” James Morle

• "RAC awareness SQL“ pythian.com

• “ETL 10gR2 loading “ Tom Kyte

• CERN IT-DES (CERN IT-DES group web site)

CERN - Real Life RAC Scalability Experiences

Documents

High Availability and Scalability Technologies - An Oracle9i RAC Solution Presented by: Arquimedes Smith

RAC Best Practices - RAC SIG 9Dec05[1]

RAC AGENDA – FEBRUARY 2007 · Revised February 17, 2015 . 1. Welcome, RAC Introductions and RAC Procedure - RAC Chair . 2. Approval of Agenda and Minutes ACTION - RAC Chair . 3

RAC Van Insurance - document.secureinsure.uk.comdocument.secureinsure.uk.com/rac/policywording/RAC-opex-wording.pdf · RAC Van Insurance Motor Legal Care is intended to cover you

120R-RAC-01 and 120R-RAC-02

Testing Storage for Oracle RAC 11g with NAS, ASM and DB Smart Flash Cache Luca Canali, CERN Dawid Wojcik, CERN UKOUG Conference, Birmingham, Dec 6 th,2011

RAC BIDCO LIMITED - RAC Corporate/media/Files/R/RAC/.../rac-bidco-limited... · RAC BIDCO LIMITED INTERIM REPORT AND FINANCIAL STATEMENTS Contents Page Company information 1 Responsibility

Oracle Database 10g GeoRaster: Scalability and Performance ... · 2.2 Oracle Database 10g and RAC The test is based on Oracle Database 10g Release 1. Oracle Real Application Clusters

Oracle RAC Virtualization - OpenLab - Cern

Oracle RAC 11gR2 RAC Joins the Cloud. RAC Grows Up RAC 11gR2 (aka 12gR1) adds a number of features that transforms RAC clusters into database clouds

1 Guide to RAC. 2 Agenda Introduction Availability Scalability Manageability Total Cost of Ownership Conclusion

CERN Achieves Database Scalability and Performance with Oracle … › 03_Documents › 4... · 2010-11-08 · Oracle DB/NetApp tips •Use NFS/DNFS (11.1 see Note 840059.1 /11.2)

Christo kutrovsky oracle rac solving common scalability problems

Step-by-Step RAC Response Guide - Hanger, Inc. · Step-by-Step RAC Response Guide Receive RAC Letter COMPLEX RAC REVIEW AUTOMATED RAC REVIEW Determine if it is a COMPLEX RAC Review

Oracle RAC Best Practices on · PDF fileLVM ... scalability benefits in addition to HA benefits. . ... Oracle RAC Best Practices on Linux Page 5 . errors, or site disasters

Achieving Scalability, Performance and Availability on Linux with Oracle 9iR2-RAC Grant McAlister Senior Database Engineer Amazon.com Paper 32110

WebLogic & Oracle RAC Active GridLink for RAC

Testing Oracle 10g RAC Scalability · 2006-02-17 · Testing Oracle RAC environments for scalability The primary advantages of Oracle RAC systems, apart from improved performance,

Databases at CERN, · CERN Database Services • Over 100 Oracle databases, most of them RAC • Running Oracle 11.2.0.4 and 12.1.0.2 • > 1100 TB of data files for production DBs

Testing Oracle 10g Rac Scalability