19
Mining Supercomputer Jobs' I/O Behavior from System Logs Xiaosong Ma

Mining Supercomputer Jobs' I/O Behavior from System Logs · Prior Work: IOSI Workflow Target App (User ID + App ID) Job scheduler Throughput logs logs Start_time End_time 2011-10-16

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Mining Supercomputer Jobs' I/O Behavior from System Logs · Prior Work: IOSI Workflow Target App (User ID + App ID) Job scheduler Throughput logs logs Start_time End_time 2011-10-16

MiningSupercomputerJobs'I/OBehaviorfromSystemLogs

Xiaosong Ma

Page 2: Mining Supercomputer Jobs' I/O Behavior from System Logs · Prior Work: IOSI Workflow Target App (User ID + App ID) Job scheduler Throughput logs logs Start_time End_time 2011-10-16

2

Rhea512 node

DevelopmentCluster

Eos736 Node

Cray XC30Cluster

Atlas1 Atlas2

Scalable IO Network (SION) - Infiniband

OSS

144 OSS Servers

OSSOSSOSS

OSSOSS

OSSOSS

1008 OST(LUN)

OSSOSS

OSSOSS

OSSOSS

OSSOSS

144 OSS Servers

OLCF Architecture Overview

1008 OST(LUN)

Page 3: Mining Supercomputer Jobs' I/O Behavior from System Logs · Prior Work: IOSI Workflow Target App (User ID + App ID) Job scheduler Throughput logs logs Start_time End_time 2011-10-16

3

Rhea512 node

DevelopmentCluster

Eos736 Node

Cray XC30Cluster

Atlas1

MySQLdatabase

Atlas2

Scalable IO Network (SION) - Infiniband

OSS

144 OSS Servers

OSSOSSOSS

OSSOSS

OSSOSS

1008 OST(LUN)

Per-OST I/O throughput

OSSOSS

OSSOSS

OSSOSS

OSSOSS

144 OSS Servers

OLCF Architecture Overview

Monitoring tool

1008 OST(LUN)Server-side I/O throughput logs

Page 4: Mining Supercomputer Jobs' I/O Behavior from System Logs · Prior Work: IOSI Workflow Target App (User ID + App ID) Job scheduler Throughput logs logs Start_time End_time 2011-10-16

4

Server-side I/O Throughput Logs

RAID controllerCoarse-granule logging

Page 5: Mining Supercomputer Jobs' I/O Behavior from System Logs · Prior Work: IOSI Workflow Target App (User ID + App ID) Job scheduler Throughput logs logs Start_time End_time 2011-10-16

5

Server-side I/O Throughput Logs

RAID controllerCoarse-granule logging

Page 6: Mining Supercomputer Jobs' I/O Behavior from System Logs · Prior Work: IOSI Workflow Target App (User ID + App ID) Job scheduler Throughput logs logs Start_time End_time 2011-10-16

6

I/O throughput logs

Server-side I/O Throughput Logs

RAID controllerCoarse-granule logging

Zero overhead

No user effort

No impact on user IO

Page 7: Mining Supercomputer Jobs' I/O Behavior from System Logs · Prior Work: IOSI Workflow Target App (User ID + App ID) Job scheduler Throughput logs logs Start_time End_time 2011-10-16

7

I/O throughput logs

Server-side I/O Throughput Logs

RAID controllerCoarse-granule logging

Zero overhead

No user effort

Mixed I/O traffic

No impact on user IO

Page 8: Mining Supercomputer Jobs' I/O Behavior from System Logs · Prior Work: IOSI Workflow Target App (User ID + App ID) Job scheduler Throughput logs logs Start_time End_time 2011-10-16

0 200 400 600 800 10000

1

2

3

4

5

Time (s)

Writ

e (G

B/s)

0 200 400 600 800 10000

1

2

3

4

5

Time (s)

Writ

e (G

B/s)

Prior Work: IOSI WorkflowTarget App

(User ID + App ID) Throughput logsJob scheduler logs

Start_time End_time2011-10-16 00:00 2011-10-16 02:012011-10-17 01:00 2011-10-17 04:002011-10-18 05:10 2011-10-18 07:20

Sample set

Per-sample wavelet

transform

Cross-sample I/O burst

identificationData

preprocessingIOSI

IOSI Input

IOSI Output

8

0 200 400 600 800 10000

1

2

3

4

5

Time (s)

Writ

e (G

B/s

)

0 200 400 600 800 10000

1

2

3

4

5

Time (s)

Writ

e (G

B/s

)

0 200 400 600 800 10000

1

2

3

4

5

Time (s)

Writ

e (G

B/s)

0 200 400 600 800 10000

1

2

3

4

5

Time (s)

Writ

e (G

B/s)

0 100 200 300 400 5000

0.5

1

1.5

2

2.5

3

3.5

Time (s)

Write

(GB/s

)

8

IOSI paper: “Automatic Identification of Application I/O Signatures from Noisy Server-Side Traces”, FAST '14

Page 9: Mining Supercomputer Jobs' I/O Behavior from System Logs · Prior Work: IOSI Workflow Target App (User ID + App ID) Job scheduler Throughput logs logs Start_time End_time 2011-10-16

0 200 400 600 800 10000

1

2

3

4

5

Time (s)

Writ

e (G

B/s)

0 200 400 600 800 10000

1

2

3

4

5

Time (s)

Writ

e (G

B/s)

Prior Work: IOSI WorkflowTarget App

(User ID + App ID) Throughput logsJob scheduler logs

Start_time End_time2011-10-16 00:00 2011-10-16 02:012011-10-17 01:00 2011-10-17 04:002011-10-18 05:10 2011-10-18 07:20

Sample set

Per-sample wavelet

transform

Cross-sample I/O burst

identificationData

preprocessingIOSI

IOSI Input

IOSI Output

9

0 200 400 600 800 10000

1

2

3

4

5

Time (s)

Writ

e (G

B/s

)

0 200 400 600 800 10000

1

2

3

4

5

Time (s)

Writ

e (G

B/s

)

0 200 400 600 800 10000

1

2

3

4

5

Time (s)

Writ

e (G

B/s)

0 200 400 600 800 10000

1

2

3

4

5

Time (s)

Writ

e (G

B/s)

0 100 200 300 400 5000

0.5

1

1.5

2

2.5

3

3.5

Time (s)

Write

(GB/s

)

9

IOSI paper: “Automatic Identification of Application I/O Signatures from Noisy Server-Side Traces”, FAST '14

Strong assumption: identical runs of app.

Page 10: Mining Supercomputer Jobs' I/O Behavior from System Logs · Prior Work: IOSI Workflow Target App (User ID + App ID) Job scheduler Throughput logs logs Start_time End_time 2011-10-16

10

AID: Automatic I/O Diverter

Job 1 Job 2 Job 3App1

App2

Job 1 Job 2 Job 3 Job 4App3

App4

App5

App6

Job 1 Job 4Job 3Job 2 Job 5

Job 1 Job 2 Job 3 Job 4 Job 5

Job 1 Job 2 Job 4 Job 5 Job 6Job 3

Time

Start_time End_time2015-10-16 00:00 2015-10-16 02:012015-10-17 01:00 2015-10-17 04:002015-10-18 05:10 2015-10-18 07:20

Job 1 Job 2 Job 4Job 3

Job 5

Scheduling suggestion

Automatically identifying I/O-heavy apps(No prior knowledge, no user involvement)

Page 11: Mining Supercomputer Jobs' I/O Behavior from System Logs · Prior Work: IOSI Workflow Target App (User ID + App ID) Job scheduler Throughput logs logs Start_time End_time 2011-10-16

11

AID: Automatic I/O Diverter

Job 1 Job 2 Job 3App1

App2

Job 1 Job 2 Job 3 Job 4App3

App4

App5

App6

Job 1 Job 4Job 3Job 2 Job 5

Job 1 Job 2 Job 3 Job 4 Job 5

Job 1 Job 2 Job 4 Job 5 Job 6Job 3

Time

Start_time End_time2015-10-16 00:00 2015-10-16 02:012015-10-17 01:00 2015-10-17 04:002015-10-18 05:10 2015-10-18 07:20

Job 1 Job 2 Job 4Job 3

Job 5

Scheduling suggestion

SC|16 Tech paper presentation:Thursday 2pm, 355D

Automatically identifying I/O-heavy apps(No prior knowledge, no user involvement)

Page 12: Mining Supercomputer Jobs' I/O Behavior from System Logs · Prior Work: IOSI Workflow Target App (User ID + App ID) Job scheduler Throughput logs logs Start_time End_time 2011-10-16

Application I/O Characterization Results

12

Name Value

Total number of logged jobs 181,969

Unique applications identified 9,998

Initial I/O-intensive candidates 95

Candidates passing scope checking 67

Candidates passing minimum support 42

User-verfied candidates 8

Result from 5 months’ Titan I/O traffic and job logs(User verification by email)

Page 13: Mining Supercomputer Jobs' I/O Behavior from System Logs · Prior Work: IOSI Workflow Target App (User ID + App ID) Job scheduler Throughput logs logs Start_time End_time 2011-10-16

Application I/O Characterization Results

13

Name Value

Total number of logged jobs 181,969

Unique applications identified 9,998

Initial I/O-intensive candidates 95

Candidates passing scope checking 67

Candidates passing minimum support 42

User-verfied candidates 8

ID Node Time(m) OST App. Domain

1 8192 1440 64 Geo-sciences

2 250 6-60 1008 Combustion

3 2048 30-185 1008 Astrophysics

4 1760 720 180 Combustion

5 1024 110-230 1008 Systems research

6 200 30-190 1008 Combustion

7 1008 13-17 1008 Computer Science

8 16388 43-310 800 Environmental

User-verified I/O-intensive applications

Page 14: Mining Supercomputer Jobs' I/O Behavior from System Logs · Prior Work: IOSI Workflow Target App (User ID + App ID) Job scheduler Throughput logs logs Start_time End_time 2011-10-16

Application I/O Characterization Results

14

Name Value

Total number of logged jobs 181,969

Unique applications identified 9,998

Initial I/O-intensive candidates 95

Candidates passing scope checking 67

Candidates passing minimum support 42

User-verfied candidates 8

Page 15: Mining Supercomputer Jobs' I/O Behavior from System Logs · Prior Work: IOSI Workflow Target App (User ID + App ID) Job scheduler Throughput logs logs Start_time End_time 2011-10-16

Application I/O Characterization Results

15

Name Value

Total number of logged jobs 181,969

Unique applications identified 9,998

Initial I/O-intensive candidates 95

Candidates passing scope checking 67

Candidates passing minimum support 42

User-verfied candidates 8

Page 16: Mining Supercomputer Jobs' I/O Behavior from System Logs · Prior Work: IOSI Workflow Target App (User ID + App ID) Job scheduler Throughput logs logs Start_time End_time 2011-10-16

Application I/O Characterization Results

16

Name Value

Total number of logged jobs 181,969

Unique applications identified 9,998

Initial I/O-intensive candidates 95

Candidates passing scope checking 67

Candidates passing minimum support 42

User-verfied candidates 8

Applications not using parallel I/O systems well!• Similar finding as Huong 2015 HPDC work (Darshan)• Motivates better I/O performance data analysis• Connecting programs to systems

Page 17: Mining Supercomputer Jobs' I/O Behavior from System Logs · Prior Work: IOSI Workflow Target App (User ID + App ID) Job scheduler Throughput logs logs Start_time End_time 2011-10-16

Questions?

Xiaosong [email protected]

Qatar Computing Research Institute, Hamad Bin Khalifa University

17

Page 18: Mining Supercomputer Jobs' I/O Behavior from System Logs · Prior Work: IOSI Workflow Target App (User ID + App ID) Job scheduler Throughput logs logs Start_time End_time 2011-10-16

I/O Contention on Large-Scale HPC Systems

• 27.1 PF Peak performance• 18,688 compute nodes

• 16-core AMD Opteron• Nvidia Tesla GPU• 32 + 6 GB memory

• 3-D Torus interconnect

ORNL’sTitan(World’s#3Supercomputer)

18

Performance variance on HPC• Shared parallel file system• I/O-heavy jobs collision -> I/O

performance degradation

I/O performance variance on Titan with IOR [6]

Page 19: Mining Supercomputer Jobs' I/O Behavior from System Logs · Prior Work: IOSI Workflow Target App (User ID + App ID) Job scheduler Throughput logs logs Start_time End_time 2011-10-16

CDF of per-OST I/O throughput

19

88.4%time<1%capacity(5MB/s)

98.5%time<5%capacity(25MB/s)

99.6%time<20%capacity(100MB/s)