Upload
marlon
View
44
Download
0
Embed Size (px)
DESCRIPTION
TESTING FAX USING SSS and FDR datasets. 2 nd April 2013. DETAILS. Dataset: user.flegger .*.data12_8TeV .00212172. physics_Muons.merge.NTUP_SMWZ.f479_m1228_p1067_p1141_tid01007411_00 500GB WNs: UC3 and UCT3 Discovery: Global redirector Running against: fax.mwt2.org - PowerPoint PPT Presentation
Citation preview
TESTING FAX USING SSS AND FDR DATASETS
2ND APRIL 2013
Ilija Vukotic [email protected] 2
DETAILSDataset:user.flegger.*.data12_8TeV.00212172.physics_Muons.merge.NTUP_SMWZ.f479_m1228_p1067_p1141_tid01007411_00
500GB
WNs: UC3 and UCT3
Discovery: Global redirector
Running against: fax.mwt2.org
Ramp-up: 4 jobs a minute
Full data copy – split in 138 jobs for each site
Average input size: 3.62 GB
Duration does not include time for job to start
Duration does not include dq2-put time.
Ilija Vukotic [email protected] 3
JOBS
Ilija Vukotic [email protected] 4
MWT2- 2 jobs hanging – finish with no error, but only next day
- UCT3 show the same efficiency as UC3
- Avg. cpu eff.: 76.5%
- Avg. dur. 5:59
- Avg. rate: 290 kB/s
- Total rate: 39 MB/s
1.5 2 2.5 3 3.5 4 4.50:00:00
2:24:00
4:48:00
7:12:00
9:36:00
12:00:00
14:24:00
16:48:00
19:12:00
f(x) = 0.0417583087209933 x + 0.0981891437026868R² = 0.0196108364468329
GB
du
rati
on
Ilija Vukotic [email protected] 5
AGLT2- 4 jobs hanging – finish with no error, but next day
- Avg. CPU efficiency: 70.5%
- Avg. dur. 6 h 14 min
- Avg. rate: 165 kB/s
- Total rate: 22MB/s
1.5 2 2.5 3 3.5 4 4.500
4320
8640
12960
17280
21600
25920
30240
34560
f(x) = 0.0733543217178845 x − 0.00457386591299641R² = 0.545737785138636
GB
seco
nd
s
Ilija Vukotic [email protected] 6
BU
1.5 2 2.5 3 3.5 4 4.50:00:00
2:24:00
4:48:00
7:12:00
9:36:00
12:00:00
14:24:00
16:48:00
19:12:00
21:36:00
0:00:00
f(x) = 0.112782777608483 x + 0.052593461910591R² = 0.26105314335286
GB
du
rati
on
- 18 jobs hanging
- Avg. CPU efficiency: 35%
- Avg. dur. 11 h 2 min
- Avg. rate: 108 kB/s
- Total rate: 14 MB/s
Ilija Vukotic [email protected] 7
MWT2 – 300 BRANCHES
1.5 3.5 5.5 7.5 9.5 11.5 13.50:00:00
1:12:00
2:24:00
3:36:00
4:48:00
6:00:00
7:12:00
8:24:00
GB
du
rati
on
- 48 jobs in parallel
- Avg. CPU efficiency: 17%
- Avg. dur. 3 h 20 min
- Avg. rate: 926 kB/s
- Total rate: 44 MB/s
Ilija Vukotic [email protected] 8
CONCLUSION 1
Rechecked that dq2-put times were not included.
Times seems to be properly measured.
Need to solve mystery of huge CPU times.
• Maybe will have to move to c++ version.
Ilija Vukotic [email protected] 9
SSS DOING XRDCP
The same DS.
But doing simple xrdcp to /dev/null.
Up to 290 jobs in parallel (UC3 and UCT3)
Ilija Vukotic [email protected] 10
SSS DOING XRDCPWanted to do all sites that are in FAX and have FDR dataset.
Most did not work:
• When asked through glrd.usatlas.org.
• Some of them even when asked directly.
• Some work for 5-10 files but then give up.
• Some work on repeated queries.
ML monitor not adequate anymore.
• CERN, some UK sites sending all traffic
• Something strange with AGLT2 numbers
• Something wrong with ML
Ilija Vukotic [email protected]
SSS DOING XRDCPErrors mostlyLast server error 10000 ('’) Error accessing path/file for … (BNL)
Very strange error in setting up environment. Not FAX related.Created //.asetup. Please look and (optional) edit it.AtlasSetup(WARNING): Unable to write ${HOME} save filemkdir: cannot create directory `//workarea': Permission denied /cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase/utilities/createUserASetup.sh: line 40: //.asetup: Permission denied
Ilija Vukotic [email protected] 12
RESULTS
Ilija Vukotic [email protected] 13
CONCLUSION 2
Automatic tests for SSB are not enough.
In absence of users that would report problem, will need additional manual checks from time to time.
Monitoring needs to be validated from beginning till the end.
Huge difference in rates – need cost matrix ASAP
Rates observed sound reasonable.
Our understanding would hugely benefit from perfSonar tests over the same links.
TESTING FAX USING HC AND FDR DATASETS
2ND APRIL 2013
Ilija Vukotic [email protected] 16
200197507 worked3 did not start4 failed
Ilija Vukotic [email protected] 17
20019750SWT2_CPBLog put error: Error copying the file: 256, cp: cannot create regular file /xrd/atlasuserdisk/user.gangarbt.hc20019750.ANALY_SWT2_CPB.25/user.gangarbt.32893735._SLACPut error: Error copying the file: 256, cp: accessing `/xrootd/atlas/atlasuserdisk/user.gangarbt.hc20019750.ANALY_SLAC.43/user.gangarbt.32887595.EXT0._00418.HWWSkimmedNTUP.root?oss.cgroup=ATLASUSERDISK': Transport endpoint is not connectedQMULGet error: Staging input file failedMWT2Download: 2444 seconds ROMA1Finished: 44 Timed out:12 FZKFinished: 4 Timed out: 46 Get error: Staging input file failedECDFFinished: 36 Failed: 11 pilotErrorDiag: Too little space left on local disk to run jobCERNGet error: Staging input file failedBU Finished 23 Failed:12Not enough local space for staging input files and run the jobAGLTFinished: 17BNLFinished: 231 Failed:8 – lost heart beat or unspecified.OU_OCHEP_SWT2, JINR,FZU – did not start
Ilija Vukotic [email protected] 18
20019750
Ilija Vukotic [email protected] 20
20019749The same idea as 20019749 but much more sites and random files:user.flegger.*…
Did not work as I expected it: each site was always running against a random but same dataset.
Ilija Vukotic [email protected] 21
20019749
Ilija Vukotic [email protected] 22
CONCLUSION 3
While there are many fails, some seem easy to fix (not enough space on disk, etc.)
Some are the same ones observed in SSS based tests.
We need to look at performance. Often it is better to fail than have very low performance. How low is unacceptably low?
Need to start looking at site that are not part of FAX.
Ilija Vukotic [email protected] 23
DIRECT FDR HC JOBS
Ilija Vukotic [email protected] 24
CONCLUSION
Testing:
• Need faster turn around.
• Would it help:
• Each 6 hours one HC submitted job at each ANALY queue• Against a very stable door • With tools we have now there is no way to precisely stress
test sites.• Fill up table at the slide 21. make it green
Monitoring:
• ML almost useless now.
• Need full validation, specially CERN FAX dashboard
25
SYSTEMATIC FDR LOAD TESTS IN PROGRESS
US cloud results. 10 jobs * 10 SMWZ files ~ 50GB
MWT2 BNL-ATLAS AGLT2 BU_ATLAS_Tier2
WT20
10
20
30
40
50
60
70
80
XRDCP BNL-ATLASAGLT2OU_OCHEP_SWT2
Source
MB
/s
MWT2 BNL-ATLAS AGLT2 BU_ATLAS_Tier2
WT20
5
10
15
20
25
Read 10% ev. 30MB TTCBNL-ATLASAGLT2OU_OCHEP_SWT2
SOURCE
MB
/s
CPU limited
Factors affecting spreads: pair-wise network latency, throughput, storage “business”
26
SYSTEMATIC FDR LOAD TESTS IN PROGRESS
US cloud results
27
SYSTEMATIC FDR LOAD TESTS IN PROGRESS
EU cloud results
BNL-ATLAS CERN-PROD ECDF ROMA1 QMUL0
20
40
60
80
100
120
XRDCP BNL-ATLASCERN-PRODECDFDESY-HHROMA1QMUL
Source
MB
/s
28
SYSTEMATIC FDR LOAD TESTS IN PROGRESS
EU cloud results
destinationevents/s BNL-ATLAS CERN-PROD ECDF ROMA1 QMUL
source
BNL-ATLAS 126.76 29.4 25.1 26.05 57.26CERN-PROD 82.68 232.52 108.46 123.52 145.96
ECDF 80.68 56.06 252.39 62.83 145.18ROMA1 32 73.66 23.95 197.01 49.72QMUL 41.34 24.14 52.2 99.43 105.46
MB/s BNL-ATLAS CERN-PROD ECDF ROMA1 QMUL
source
BNL-ATLAS 13.07 3.03 2.61 2.65 5.84CERN-PROD 8.36 23.26 11.02 12.71 14.68
ECDF 8.23 5.64 25.14 6.52 14.42ROMA1 3.15 7.49 2.47 20.77 4.79QMUL 4.26 2.6 5.33 9.65 10.38
BNL-ATLAS CERN-PROD ECDF ROMA1 QMUL0
5
10
15
20
25
30
Read 10% events 30MB TTC BNL-ATLAS
CERN-PROD
ECDF
ROMA1
QMUL
Source
MB
/s