View
334
Download
2
Category
Preview:
Citation preview
CERN IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
Eric Grancher eric.grancher@cern.ch
CERN IT
Anton Topurovanton.topurov@cern.ch
openlab, CERN IT
Real life experiences of RAC scalability with a 6 node cluster
CERN IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
Outline
• CERN computing challenge• Oracle RDBMS and Oracle RAC @ CERN• RAC scalability – what, why and how?• Real life scalability examples• Conclusions• References
CERN IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
LHC gets ready …
Jürgen Knobloch / CERN
CERN IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
The LHC Computing Challenge
• Signal/Noise 10-9
• Data volume– High rate * large number of
channels * 4 experiments
� 15 PetaBytes of new data each year
• Compute power– Event complexity * Nb.
events * thousands users� 100 k of (today's) fastest
CPUs
• Worldwide analysis & funding– Computing funding locally
in major regions & countries
– Efficient analysis everywhere
� GRID technologyJürgen Knobloch / CERN
CERN IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it Jürgen Knobloch/CERN Slide 5
WLCG Collaboration
• The Collaboration– 4 LHC experiments
– ~250 computing centres– 12 large centres
(Tier-0, Tier-1)– 38 federations of smaller
“Tier-2” centres
• Growing to ~40 countries– Grids: EGEE, OSG, Nordugrid
• Technical Design Reports– WLCG, 4 Experiments: June 2005
• Memorandum of Understanding– Agreed in October 2005
• Resources– 5-year forward look
Jürgen Knobloch / CERN
CERN IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
Centers around the world form a Supercomputer
• The EGEE and OSG projects are the basis of the Worldwide LHC Computing Grid Project WLCG
Inter-operation between Grids is working!Jürgen Knobloch / CERN
CERN IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it Jürgen Knobloch/CERN Slide 7
CPU & Disk Requirements 2006
0
50
100
150
200
250
300
350
2007 2008 2009 2010
Year
MS
I2000
LHCb-Tier-2CMS-Tier-2ATLAS-Tier-2ALICE-Tier-2LHCb-Tier-1CMS-Tier-1ATLAS-Tier-1ALICE-Tier-1LHCb-CERNCMS-CERNATLAS-CERNALICE-CERN
0
20
40
60
80
100
120
140
2007 2008 2009 2010
Year
PB
CPU Disk
CERN: ~ 10%
Jürgen Knobloch / CERN
CERN IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
Oracle databases at CERN
• 1982 : CERN starts using Oracle• 1996: OPS on Sun SPARC Solaris• 2000: Use of Linux x86 for Oracle RDBMS
• Today : Oracle RAC for most demanding services:– CASTOR mass storage system (15 PB / year)– Administrative applications (AIS)– Accelerators and controls etc.– LHC Computing Grid (LCG)
CERN IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
(our view of) RAC basics
• Shared disk infrastructure, all disk devices have to be accessible from all servers
• Shared buffer cache (with coherency!)
Clients DB Servers Storage
CERN IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
Linux RAC deployment example
• RAC for Administrative Services move (2006)– 4 nodes x 4 CPUs, 16GB RAM/node– RedHat Enterprise Linux 4 / x86_64– Oracle RAC 10.2.0.3
• Consolidation of administrative applications, first:– CERN Expenditure Tracking– HR management, planning and follow-up as well as a self
service application– Computer resource allocation, central repository of
information (“foundation”)…
• Reduce number of databases, profit from HA, profit from information sharing between the applications (same database).
CERN IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
Linux RAC deployment for administrative applications
• Use of Oracle 10g services– One service per application OLTP workload– One service per application batch workload
– One service per application external access
• Use of Virtual IPs• Raw devices only for OCR and quorum devices• ASM disk management (2 disk groups)• RMAN backup to TSM
• Things became must easier and stable over time (GC_FILES_TO_LOCKS, shared raw device volumes …) , good experience with consolidation on RAC
CERN IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
RAC Scalability
Frits Ahlefeldt-Laurvig / http://downloadillustration.com/
CERN IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
RAC Scalability (1)
2 ways of upgrading database performance
Upgrading the hardware (scale up)- Expensive and inflexible
Adding more hardware (scale out)- Less expensive hardware and more flexible
CERN IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
RAC Scalability (2)
Adding more hardware (scale out)Pros:- Cheap and flexible- Getting more popular- Oracle RAC is there to help youCons:- More is not always better- Achieving desired scalability is not easy.
Why?
CERN IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
Reason 1
• Shared disk infrastructure, all disk devices have to be accessible from all servers
• Shared buffer cache (with coherency!)
Clients DB Servers Storage
CERN IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
Reason 2
CERN IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
Reason 3
CERN IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
Examples
2 real life RAC scalability examples
• CASTOR Name Server
• PVSS
CERN IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
CASTOR Name Server
• CASTOR– CERN Advanced STORage manager– Store physics production files and user files
• CASTOR Name Server– Implements hierarchical view of CASTOR name
space– Multithreaded software– Uses Oracle Database for storing files metadata
CERN IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
Stress Test Application
• Multithreaded • Used with up to 40 threads • Each thread loops 5000 times on
– Creating a file– Checking it’s parameters– Changing size of the file
• Test Made:– Single instance vs. 2 nodes RAC– No changes in schema and application code
CERN IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
Result
CNS : Single instance vs RAC
0
100
200
300
400
500
600
1 2 5 7 10 12 14 16 20 25 30 35 40
Threads
ops/
s
Single instance
RAC, 2 instances
CERN IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
Analysis 1/2
Problem:• Contention on CNS_FILE_METADATA table
Change:• Hash partition with local PK index
Result: 10% gain, but still worse than single instance
CERN IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
Analysis (2/2)
• Top event:– enq: TX - row lock contention– Again on the CNS_FILE_METADATA
• Findings:– Application logic causes row lock contention– Table structure reorganization can’t help
• Follow – up– No simple solution– Work in progress now
CERN IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
PVSS II
• Commercial SCADA application – critical for LHC and experiments
• Archiving in Oracle Database
• Out of the box performance:100 “changes” per second
• CERN needs: 150 000 changes per second = 1500 times faster!
CERN IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
The Tuning Process
1. run the workload, gather ASH/AWR information, 10046…
2. find the top event that slows down the processing
3. understand why time is spent on this event
4. modify client code, database schema, database code, hardware configuration
CERN IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
PVSS Tuning (1/6)
• Shared resource :EVENTS_HISTORY (ELEMENT_ID, VALUE…)
• Each client “measures” input and registers history with a “merge” operation in the EVENTS_HISTORY table
Performance:• 100 “changes” per second
Tableevents_history
trigger on update eventlast… merge (…)
150 Clients DB Servers Storage
Update eventlastval set …
Update eventlastval set …
Tableeventlastval
CERN IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
PVSS Tuning (2/6)
Initial state observation:
• database is waiting on the clients “SQL*Net message from client ”
• Use of a generic library C++/DB• Individual insert (one statement per entry) • Update of a table which keeps “latest state”
through a trigger
CERN IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
PVSS Tuning (3/6)
Changes: • bulk insert to a temporary table with OCCI, then call PL/SQL
to load data into history table
Performance:• 2000 changes per second
Now top event: “db file sequential read ”
System I/O3.73123,951db file parallel write
System I/O5.81191,133log file parallel write
18.8861CPU time
Other37.2212041enq: TX - contention
User I/O42.5613729,242db file sequential read
Wait ClassPercent Total DB TimeTime(s)WaitsEventawrrpt_1_5489_5490.html
CERN IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
PVSS Tuning (4/6)
Changes: • Index usage analysis and reduction• Table structure changes. IOT. • Replacement of merge by insert. • Use of “direct path load” with ETLPerformance:• 16 000 “changes” per second • Now top event: cluster related wait event
test5_rac_node1_8709_8710.html
Cluster8.62198118,454gc current block 2-way
Cluster9.9922824,370gc current grant busy
Cluster11.1372556,818gc current block busy
16.0369CPU time
Cluster31.62672827,883gc buffer busy
Wait Class% Total Call Time
Avg Wait(ms)Time(s)WaitsEvent
CERN IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
PVSS Tuning (5/6)
Changes:• Each “client” receives a unique number.• Partitioned table. • Use of “direct path load” to the partition with ETL Performance:• 150 000 changes per second • Now top event : “freezes ” once upon a while
rate75000_awrrpt_2_872_873.html
Configuration3.6088785,439undo segment extension
System I/O4.5711091,542log file parallel write
5.1123CPU time
Cluster6.4221557,218gc current multi block request
Concurrency27.6818665813row cache lock
Wait Class% Total Call TimeAvgWait(ms)
Time(s)WaitsEvent
CERN IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
PVSS Tuning (6/6)
Problem investigation:• Link between foreground process and ASM processes• Difficult to interpret ASH report, 10046 trace
Problem identification:• ASM space allocation is blocking some operations
Changes: • Space pre-allocation, background task.
Result:• Stable 150 000 “changes” per second.
CERN IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
PVSS Tuning Schema
150 Clients DB Servers Storage
Tableevents_history
PL/SQL: insert /*+ APPEND */
into eventh (… ) partition
PARTITION (1)select …
from temp
Temptable
Tableevents_history
trigger on update eventlast… merge (…)
Tableeventlastval
Update eventlastval set …
Update eventlastval set …
Bulk insert into temp table
Bulk insert into temp table
CERN IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
PVSS Tuning Summary
Conclusion: • from 100 changes per second to 150 000
“changes” per second • 6 nodes RAC (dual CPU, 4GB RAM), 32
disks SATA with FCP link to host• 4 months effort:
– Re-writing of part of the application with changes interface (C++ code)
– Changes of the database code (PL/SQL)– Schema change– Numerous work sessions, joint work with other
CERN IT groups
CERN IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
Scalability Conclusions
• ASM / cluster filesystem / NAS allow– a much easier deployment– way less complexity and human error risk.
• 10g RAC is much easier to tune• RAC can boost your application performance, but
also disclose the weak design points and magnify their impact
• Proper application design is the key to almost linear scalability for a “non-read-only” application
CERN IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
Recommendations
• Understand the extra features• Take advantage of RAC connections
management– Use « services » with resource management.– Load balancing : client side, server side, client +
server side…
• Use connection failover mechanisms: – « Transparent Application Failover » re-
connection, loss of modifications non yet committed.
– « Fast Connection Failover » (10g), notification, client side to be implemented.
– Minimum: re-connection.
CERN IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
The Message to You
• RAC can scale even for write intensive applications
• As you know, RAC scalability tuning is challenging
• Create RAC-aware applications from the beginning– Application design is key
• The effort is paid back:– With better application performance– With highly available system– With flexibility and lower price for hardware
CERN IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
References
• “Pro Oracle Database 10g RAC on Linux”Julian Dyke and Steve Shaw
• “Connection management in RAC” James Morle
• "RAC awareness SQL“ pythian.com
• “ETL 10gR2 loading “ Tom Kyte
• CERN IT-DES (CERN IT-DES group web site)
CERN IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
Q&A
Recommended