Understanding Issue Correlations: A Case Study of the ...jhuang95/papers/SoCC15-Jian.pdfHadoop...

Preview:

Citation preview

Understanding Issue Correlations: A Case Study of the Hadoop System

Jian Huang

Xuechen Zhang† Karsten Schwan †

2

Why Issue Study Matters?

Scalable distributed systems are complex [Yuan et al., OSDI’14]

Complicated System

2

Why Issue Study Matters?

Scalable distributed systems are complex [Yuan et al., OSDI’14]

Complicated System Error-prone

+

2

Why Issue Study Matters?

Scalable distributed systems are complex [Yuan et al., OSDI’14]

Complicated System Error-prone

+ Hard to Debug

+

2

Why Issue Study Matters?

Scalable distributed systems are complex [Yuan et al., OSDI’14]

Complicated System

Issue Study

Issue Pattern

Error-prone

+ Hard to Debug

+

2

Why Issue Study Matters?

Scalable distributed systems are complex [Yuan et al., OSDI’14]

Complicated System

Issue Study

Issue Pattern

Error-prone

+ Hard to Debug

+

Better Software & Debugging Tools

+

3

Hadoop: A Representative Distributed System

3

Hadoop: A Representative Distributed System

0

2

4

6

8

10

2008 2009 2010 2011 2012 2013 2014 2015

Numb

er of

Rep

orted

Issu

es

(x100

0)

The Evolution of Apache Hadoop

HDFS (Storage) MapRedue (Computation)

3

Hadoop: A Representative Distributed System

0

2

4

6

8

10

2008 2009 2010 2011 2012 2013 2014 2015

Numb

er of

Rep

orted

Issu

es

(x100

0)

The Evolution of Apache Hadoop

HDFS (Storage) MapRedue (Computation)

……

3

Hadoop: A Representative Distributed System

Learn from issues – more than 6 years of experience.

0

2

4

6

8

10

2008 2009 2010 2011 2012 2013 2014 2015

Numb

er of

Rep

orted

Issu

es

(x100

0)

The Evolution of Apache Hadoop

HDFS (Storage) MapRedue (Computation)

……

4

What Can We Learn From Issues?

[Gunawi et al., SoCC’14] What Bugs Live in the Cloud?

[Lu et al., FAST’13] A Study of Linux File System Evolution ……

Related Work

4

What Can We Learn From Issues?

[Gunawi et al., SoCC’14] What Bugs Live in the Cloud?

[Lu et al., FAST’13] A Study of Linux File System Evolution ……

Related Work

Our Focus: Issue Correlations

Tools

Programming

Systems

5

Our Findings

•  Half of the issues are independent •  MapReduce issues tend to relate to YARN •  One third of the issues have similar causes •  ......

5

Our Findings

•  Half of the issues are independent •  MapReduce issues tend to relate to YARN •  One third of the issues have similar causes •  ......

Tools

Programming

Systems

•  Memory: GC is still the No. 1 concern •  Storage: “99.99% of data reliability” is challenged •  Programming: one third of them relate to interfaces •  Tools: the logging in Hadoop is error-prone •  ......

6

Methodology Used in Our Study …

HDFS

HBase

HCatalog Mahout

MapReduce

Cascading

Hive Pig Flume …

Hadoop Ecosystem

6

Methodology Used in Our Study

Computation

Storage

HDFS

HBase

HCatalog Mahout

MapReduce

Cascading

Hive Pig Flume …

Hadoop Ecosystem

6

Methodology Used in Our Study

Computation

Storage

HDFS

HBase

HCatalog Mahout

MapReduce

Cascading

Hive Pig Flume …

Closed Issues

Examined Issues 2180 2038 2359 2340

Hadoop Ecosystem

6

Methodology Used in Our Study

Computation

Storage

HDFS

HBase

HCatalog Mahout

MapReduce

Cascading

Hive Pig Flume …

Closed Issues

Examined Issues 2180 2038 2359 2340 Sampling Rate

89.8%

Hadoop Ecosystem

6

Methodology Used in Our Study

Computation

Storage

HDFS

HBase

HCatalog Mahout

MapReduce

Cascading

Hive Pig Flume …

Closed Issues

Examined Issues 2180 2038 2359 2340

Sampling Period ~6 years 5 years

Sampling Rate 89.8%

Hadoop Ecosystem

6

Methodology Used in Our Study Issues

Description Patches Follow-up Discussions

Source Code Analysis

6

Methodology Used in Our Study Issues

Description Patches Follow-up Discussions

Source Code Analysis

IssueID

Create/Commit Time Subcomponent Type Causes

CorrelatedIssueID …… HPatchDB

Labeling

7

Where Are the Correlated Issues From?

Do you know where I’m from?

#Correlated Issues 0 1 2 3 >=4

External HDFS 94.7% 4.8% 0.5% - -

MapReduce 79.3% 17.1% 2.8% 0.5% 0.3%

Internal HDFS 52.7% 32.8% 9.1% 3.1% 2.3%

MapReduce 59.3% 32.7% 5.6% 1.3% 1.0%

External Correlation correlated issues appear in other systems Internal Correlation correlated issues appear in the same system

A

B C

7

Where Are the Correlated Issues From?

#Correlated Issues 0 1 2 3 >=4

External HDFS 94.7% 4.8% 0.5% - -

MapReduce 79.3% 17.1% 2.8% 0.5% 0.3%

Internal HDFS 52.7% 32.8% 9.1% 3.1% 2.3%

MapReduce 59.3% 32.7% 5.6% 1.3% 1.0%

A

B C

7

Where Are the Correlated Issues From?

#Correlated Issues 0 1 2 3 >=4

External HDFS 94.7% 4.8% 0.5% - -

MapReduce 79.3% 17.1% 2.8% 0.5% 0.3%

Internal HDFS 52.7% 32.8% 9.1% 3.1% 2.3%

MapReduce 59.3% 32.7% 5.6% 1.3% 1.0%

A

B C

7

Where Are the Correlated Issues From?

#Correlated Issues 0 1 2 3 >=4

External HDFS 94.7% 4.8% 0.5% - -

MapReduce 79.3% 17.1% 2.8% 0.5% 0.3%

Internal HDFS 52.7% 32.8% 9.1% 3.1% 2.3%

MapReduce 59.3% 32.7% 5.6% 1.3% 1.0%

A significant number of issues are independent.

A

B C

7

Where Are the Correlated Issues From?

#Correlated Issues 0 1 2 3 >=4

External HDFS 94.7% 4.8% 0.5% - -

MapReduce 79.3% 17.1% 2.8% 0.5% 0.3%

Internal HDFS 52.7% 32.8% 9.1% 3.1% 2.3%

MapReduce 59.3% 32.7% 5.6% 1.3% 1.0%

Half of them are from YARN.

A

B C

7

Where Are the Correlated Issues From?

#Correlated Issues 0 1 2 3 >=4

External HDFS 94.7% 4.8% 0.5% - -

MapReduce 79.3% 17.1% 2.8% 0.5% 0.3%

Internal HDFS 52.7% 32.8% 9.1% 3.1% 2.3%

MapReduce 59.3% 32.7% 5.6% 1.3% 1.0%

A

B C

7

Where Are the Correlated Issues From?

#Correlated Issues 0 1 2 3 >=4

External HDFS 94.7% 4.8% 0.5% - -

MapReduce 79.3% 17.1% 2.8% 0.5% 0.3%

Internal HDFS 52.7% 32.8% 9.1% 3.1% 2.3%

MapReduce 59.3% 32.7% 5.6% 1.3% 1.0%

Half of them are independent.

A

B C

7

Where Are the Correlated Issues From?

#Correlated Issues 0 1 2 3 >=4

External HDFS 94.7% 4.8% 0.5% - -

MapReduce 79.3% 17.1% 2.8% 0.5% 0.3%

Internal HDFS 52.7% 32.8% 9.1% 3.1% 2.3%

MapReduce 59.3% 32.7% 5.6% 1.3% 1.0%

8

How the Issues Are Correlated?

Do you know our relationship?

8

How the Issues Are Correlated?

Similar Causes Issues have similar causes Blocking Other Issues Issues need to be fixed before fixing other issues Fix on Fix Issues are caused by fixing other issues

8

How the Issues Are Correlated?

0

10

20

30

40

HDFS MapReduce

Perce

ntage

(%)

Similar Causes Blocking Other Issues Fix on Fix

26-33% of the issues have similar causes.

8

How the Issues Are Correlated?

0

10

20

30

40

HDFS MapReduce

Perce

ntage

(%)

Similar Causes Blocking Other Issues Fix on Fix

These issues that block others appear more frequently in HDFS.

8

How the Issues Are Correlated?

0

10

20

30

40

HDFS MapReduce

Perce

ntage

(%)

Similar Causes Blocking Other Issues Fix on Fix

Mostly due to functional dependency.

9

Tools

Programming

Systems

On the Issue Correlations with System Characteristics

9

Tools

Programming

Systems

47%

27%

26%

On the Issue Correlations with System Characteristics

10

How Issues Relate to Systems?

0 10 20 30 40 50 60 70 80 90

100

HDFS MapReduce

Perce

ntage

(%)

security networking storage file system memory cache 10

How Issues Relate to Systems?

0 10 20 30 40 50 60 70 80 90

100

HDFS MapReduce

Perce

ntage

(%)

security networking storage file system memory cache 10

How Issues Relate to Systems?

•  LightWeightGSet Vs. java.util structure

•  Object cache for long lived object:

ReplicasMap, ReplicasInfo

GC is still the No.1 concern, memory-friendly objects are preferred.

0 10 20 30 40 50 60 70 80 90

100

HDFS MapReduce

Perce

ntage

(%)

security networking storage file system memory cache 10

How Issues Relate to Systems?

File system semantic: namespace management, file permission,

consistency (e.g., fsck), etc.

Many issues happened in file system like EXT4 appear in Hadoop.

0 10 20 30 40 50 60 70 80 90

100

HDFS MapReduce

Perce

ntage

(%)

security networking storage file system memory cache 10

How Issues Relate to Systems?

Issues in rack placement policy:

0.16% of blocks and their replicas are in the same rack upon system upgrade.

The statement of the 99.99% of data reliability in cloud storage is challenged.

0 10 20 30 40 50 60 70 80 90

100

HDFS MapReduce

Perce

ntage

(%)

security networking storage file system memory cache 10

How Issues Relate to Systems?

One quarter of networking issues cause resource wastage.

Read a block:

Peer peer = newTcpPeer(dnAddr); -  return newBlockReader(…) + try{ + reader = newBlockReader(…) + return reader + } catch (IOException ex) { + throw ex; + } finally { + if(reader == null) closeQuietly(peer); + }

Socket leak !

11

How Issues Relate to Programming?

0 10 20 30 40 50 60 70 80 90

100

HDFS MapReduce

Perce

ntage

(%)

typo lock interface maintenance

11

How Issues Relate to Programming?

0 10 20 30 40 50 60 70 80 90

100

HDFS MapReduce

Perce

ntage

(%)

typo lock interface maintenance

Half of them relate to code maintenance.

11

How Issues Relate to Programming?

0 10 20 30 40 50 60 70 80 90

100

HDFS MapReduce

Perce

ntage

(%)

typo lock interface maintenance

Mainly caused by interface changes.

11

How Issues Relate to Programming?

0 10 20 30 40 50 60 70 80 90

100

HDFS MapReduce

Perce

ntage

(%)

typo lock interface maintenance

5.6% of programming issues are caused by typos !

A fsimage cannot be accessed due to:

-  elif [ “COMMAND” = “oiv_legacy” ] then + elif [ “$COMMAND” = “oiv_legacy” ] then

12

How Issues Relate to Tools?

0 10 20 30 40 50 60 70 80 90

100

HDFS MapReduce

Perce

ntage

(%)

configuration debugging documents testing

12

How Issues Relate to Tools?

0 10 20 30 40 50 60 70 80 90

100

HDFS MapReduce

Perce

ntage

(%)

configuration debugging documents testing

Logs are misleading: incorrect, incomplete, indistinct output.

12

How Issues Relate to Tools?

0 10 20 30 40 50 60 70 80 90

100

HDFS MapReduce

Perce

ntage

(%)

configuration debugging documents testing

Logs are misleading: incorrect, incomplete, indistinct output.

Accessing a non-exist file via WebHDFS, FileNotFoundException is expected, but we get this

Logs

12

How Issues Relate to Tools?

0 10 20 30 40 50 60 70 80 90

100

HDFS MapReduce

Perce

ntage

(%)

configuration debugging documents testing

A majority of configuration issues are related to system performance.

59% of the 219 configuration parameters in MapReduce are performance related.

13

Conclusion

2

Correlations Between Issues Issues are independent; 33% of issues have similar causes, etc.

Correlations With System Characteristics More efforts are required to achieve highly reliable distributed system

1

Tools

Programming

Systems

Jian Huang jian.huang@gatech.edu

Xuechen Zhang† Karsten Schwan

Thanks!

Q&A

Recommended