52
Understanding Issue Correlations: A Case Study of the Hadoop System Jian Huang Xuechen Zhang Karsten Schwan

Understanding Issue Correlations: A Case Study of the ...jhuang95/papers/SoCC15-Jian.pdfHadoop Ecosystem . 6 Methodology Used in Our Study Computation Storage … HDFS HBase HCatalog

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Understanding Issue Correlations: A Case Study of the ...jhuang95/papers/SoCC15-Jian.pdfHadoop Ecosystem . 6 Methodology Used in Our Study Computation Storage … HDFS HBase HCatalog

Understanding Issue Correlations: A Case Study of the Hadoop System

Jian Huang

Xuechen Zhang† Karsten Schwan †

Page 2: Understanding Issue Correlations: A Case Study of the ...jhuang95/papers/SoCC15-Jian.pdfHadoop Ecosystem . 6 Methodology Used in Our Study Computation Storage … HDFS HBase HCatalog

2

Why Issue Study Matters?

Scalable distributed systems are complex [Yuan et al., OSDI’14]

Complicated System

Page 3: Understanding Issue Correlations: A Case Study of the ...jhuang95/papers/SoCC15-Jian.pdfHadoop Ecosystem . 6 Methodology Used in Our Study Computation Storage … HDFS HBase HCatalog

2

Why Issue Study Matters?

Scalable distributed systems are complex [Yuan et al., OSDI’14]

Complicated System Error-prone

+

Page 4: Understanding Issue Correlations: A Case Study of the ...jhuang95/papers/SoCC15-Jian.pdfHadoop Ecosystem . 6 Methodology Used in Our Study Computation Storage … HDFS HBase HCatalog

2

Why Issue Study Matters?

Scalable distributed systems are complex [Yuan et al., OSDI’14]

Complicated System Error-prone

+ Hard to Debug

+

Page 5: Understanding Issue Correlations: A Case Study of the ...jhuang95/papers/SoCC15-Jian.pdfHadoop Ecosystem . 6 Methodology Used in Our Study Computation Storage … HDFS HBase HCatalog

2

Why Issue Study Matters?

Scalable distributed systems are complex [Yuan et al., OSDI’14]

Complicated System

Issue Study

Issue Pattern

Error-prone

+ Hard to Debug

+

Page 6: Understanding Issue Correlations: A Case Study of the ...jhuang95/papers/SoCC15-Jian.pdfHadoop Ecosystem . 6 Methodology Used in Our Study Computation Storage … HDFS HBase HCatalog

2

Why Issue Study Matters?

Scalable distributed systems are complex [Yuan et al., OSDI’14]

Complicated System

Issue Study

Issue Pattern

Error-prone

+ Hard to Debug

+

Better Software & Debugging Tools

+

Page 7: Understanding Issue Correlations: A Case Study of the ...jhuang95/papers/SoCC15-Jian.pdfHadoop Ecosystem . 6 Methodology Used in Our Study Computation Storage … HDFS HBase HCatalog

3

Hadoop: A Representative Distributed System

Page 8: Understanding Issue Correlations: A Case Study of the ...jhuang95/papers/SoCC15-Jian.pdfHadoop Ecosystem . 6 Methodology Used in Our Study Computation Storage … HDFS HBase HCatalog

3

Hadoop: A Representative Distributed System

0

2

4

6

8

10

2008 2009 2010 2011 2012 2013 2014 2015

Numb

er of

Rep

orted

Issu

es

(x100

0)

The Evolution of Apache Hadoop

HDFS (Storage) MapRedue (Computation)

Page 9: Understanding Issue Correlations: A Case Study of the ...jhuang95/papers/SoCC15-Jian.pdfHadoop Ecosystem . 6 Methodology Used in Our Study Computation Storage … HDFS HBase HCatalog

3

Hadoop: A Representative Distributed System

0

2

4

6

8

10

2008 2009 2010 2011 2012 2013 2014 2015

Numb

er of

Rep

orted

Issu

es

(x100

0)

The Evolution of Apache Hadoop

HDFS (Storage) MapRedue (Computation)

……

Page 10: Understanding Issue Correlations: A Case Study of the ...jhuang95/papers/SoCC15-Jian.pdfHadoop Ecosystem . 6 Methodology Used in Our Study Computation Storage … HDFS HBase HCatalog

3

Hadoop: A Representative Distributed System

Learn from issues – more than 6 years of experience.

0

2

4

6

8

10

2008 2009 2010 2011 2012 2013 2014 2015

Numb

er of

Rep

orted

Issu

es

(x100

0)

The Evolution of Apache Hadoop

HDFS (Storage) MapRedue (Computation)

……

Page 11: Understanding Issue Correlations: A Case Study of the ...jhuang95/papers/SoCC15-Jian.pdfHadoop Ecosystem . 6 Methodology Used in Our Study Computation Storage … HDFS HBase HCatalog

4

What Can We Learn From Issues?

[Gunawi et al., SoCC’14] What Bugs Live in the Cloud?

[Lu et al., FAST’13] A Study of Linux File System Evolution ……

Related Work

Page 12: Understanding Issue Correlations: A Case Study of the ...jhuang95/papers/SoCC15-Jian.pdfHadoop Ecosystem . 6 Methodology Used in Our Study Computation Storage … HDFS HBase HCatalog

4

What Can We Learn From Issues?

[Gunawi et al., SoCC’14] What Bugs Live in the Cloud?

[Lu et al., FAST’13] A Study of Linux File System Evolution ……

Related Work

Our Focus: Issue Correlations

Tools

Programming

Systems

Page 13: Understanding Issue Correlations: A Case Study of the ...jhuang95/papers/SoCC15-Jian.pdfHadoop Ecosystem . 6 Methodology Used in Our Study Computation Storage … HDFS HBase HCatalog

5

Our Findings

•  Half of the issues are independent •  MapReduce issues tend to relate to YARN •  One third of the issues have similar causes •  ......

Page 14: Understanding Issue Correlations: A Case Study of the ...jhuang95/papers/SoCC15-Jian.pdfHadoop Ecosystem . 6 Methodology Used in Our Study Computation Storage … HDFS HBase HCatalog

5

Our Findings

•  Half of the issues are independent •  MapReduce issues tend to relate to YARN •  One third of the issues have similar causes •  ......

Tools

Programming

Systems

•  Memory: GC is still the No. 1 concern •  Storage: “99.99% of data reliability” is challenged •  Programming: one third of them relate to interfaces •  Tools: the logging in Hadoop is error-prone •  ......

Page 15: Understanding Issue Correlations: A Case Study of the ...jhuang95/papers/SoCC15-Jian.pdfHadoop Ecosystem . 6 Methodology Used in Our Study Computation Storage … HDFS HBase HCatalog

6

Methodology Used in Our Study …

HDFS

HBase

HCatalog Mahout

MapReduce

Cascading

Hive Pig Flume …

Hadoop Ecosystem

Page 16: Understanding Issue Correlations: A Case Study of the ...jhuang95/papers/SoCC15-Jian.pdfHadoop Ecosystem . 6 Methodology Used in Our Study Computation Storage … HDFS HBase HCatalog

6

Methodology Used in Our Study

Computation

Storage

HDFS

HBase

HCatalog Mahout

MapReduce

Cascading

Hive Pig Flume …

Hadoop Ecosystem

Page 17: Understanding Issue Correlations: A Case Study of the ...jhuang95/papers/SoCC15-Jian.pdfHadoop Ecosystem . 6 Methodology Used in Our Study Computation Storage … HDFS HBase HCatalog

6

Methodology Used in Our Study

Computation

Storage

HDFS

HBase

HCatalog Mahout

MapReduce

Cascading

Hive Pig Flume …

Closed Issues

Examined Issues 2180 2038 2359 2340

Hadoop Ecosystem

Page 18: Understanding Issue Correlations: A Case Study of the ...jhuang95/papers/SoCC15-Jian.pdfHadoop Ecosystem . 6 Methodology Used in Our Study Computation Storage … HDFS HBase HCatalog

6

Methodology Used in Our Study

Computation

Storage

HDFS

HBase

HCatalog Mahout

MapReduce

Cascading

Hive Pig Flume …

Closed Issues

Examined Issues 2180 2038 2359 2340 Sampling Rate

89.8%

Hadoop Ecosystem

Page 19: Understanding Issue Correlations: A Case Study of the ...jhuang95/papers/SoCC15-Jian.pdfHadoop Ecosystem . 6 Methodology Used in Our Study Computation Storage … HDFS HBase HCatalog

6

Methodology Used in Our Study

Computation

Storage

HDFS

HBase

HCatalog Mahout

MapReduce

Cascading

Hive Pig Flume …

Closed Issues

Examined Issues 2180 2038 2359 2340

Sampling Period ~6 years 5 years

Sampling Rate 89.8%

Hadoop Ecosystem

Page 20: Understanding Issue Correlations: A Case Study of the ...jhuang95/papers/SoCC15-Jian.pdfHadoop Ecosystem . 6 Methodology Used in Our Study Computation Storage … HDFS HBase HCatalog

6

Methodology Used in Our Study Issues

Description Patches Follow-up Discussions

Source Code Analysis

Page 21: Understanding Issue Correlations: A Case Study of the ...jhuang95/papers/SoCC15-Jian.pdfHadoop Ecosystem . 6 Methodology Used in Our Study Computation Storage … HDFS HBase HCatalog

6

Methodology Used in Our Study Issues

Description Patches Follow-up Discussions

Source Code Analysis

IssueID

Create/Commit Time Subcomponent Type Causes

CorrelatedIssueID …… HPatchDB

Labeling

Page 22: Understanding Issue Correlations: A Case Study of the ...jhuang95/papers/SoCC15-Jian.pdfHadoop Ecosystem . 6 Methodology Used in Our Study Computation Storage … HDFS HBase HCatalog

7

Where Are the Correlated Issues From?

Do you know where I’m from?

#Correlated Issues 0 1 2 3 >=4

External HDFS 94.7% 4.8% 0.5% - -

MapReduce 79.3% 17.1% 2.8% 0.5% 0.3%

Internal HDFS 52.7% 32.8% 9.1% 3.1% 2.3%

MapReduce 59.3% 32.7% 5.6% 1.3% 1.0%

Page 23: Understanding Issue Correlations: A Case Study of the ...jhuang95/papers/SoCC15-Jian.pdfHadoop Ecosystem . 6 Methodology Used in Our Study Computation Storage … HDFS HBase HCatalog

External Correlation correlated issues appear in other systems Internal Correlation correlated issues appear in the same system

A

B C

7

Where Are the Correlated Issues From?

#Correlated Issues 0 1 2 3 >=4

External HDFS 94.7% 4.8% 0.5% - -

MapReduce 79.3% 17.1% 2.8% 0.5% 0.3%

Internal HDFS 52.7% 32.8% 9.1% 3.1% 2.3%

MapReduce 59.3% 32.7% 5.6% 1.3% 1.0%

Page 24: Understanding Issue Correlations: A Case Study of the ...jhuang95/papers/SoCC15-Jian.pdfHadoop Ecosystem . 6 Methodology Used in Our Study Computation Storage … HDFS HBase HCatalog

A

B C

7

Where Are the Correlated Issues From?

#Correlated Issues 0 1 2 3 >=4

External HDFS 94.7% 4.8% 0.5% - -

MapReduce 79.3% 17.1% 2.8% 0.5% 0.3%

Internal HDFS 52.7% 32.8% 9.1% 3.1% 2.3%

MapReduce 59.3% 32.7% 5.6% 1.3% 1.0%

Page 25: Understanding Issue Correlations: A Case Study of the ...jhuang95/papers/SoCC15-Jian.pdfHadoop Ecosystem . 6 Methodology Used in Our Study Computation Storage … HDFS HBase HCatalog

A

B C

7

Where Are the Correlated Issues From?

#Correlated Issues 0 1 2 3 >=4

External HDFS 94.7% 4.8% 0.5% - -

MapReduce 79.3% 17.1% 2.8% 0.5% 0.3%

Internal HDFS 52.7% 32.8% 9.1% 3.1% 2.3%

MapReduce 59.3% 32.7% 5.6% 1.3% 1.0%

A significant number of issues are independent.

Page 26: Understanding Issue Correlations: A Case Study of the ...jhuang95/papers/SoCC15-Jian.pdfHadoop Ecosystem . 6 Methodology Used in Our Study Computation Storage … HDFS HBase HCatalog

A

B C

7

Where Are the Correlated Issues From?

#Correlated Issues 0 1 2 3 >=4

External HDFS 94.7% 4.8% 0.5% - -

MapReduce 79.3% 17.1% 2.8% 0.5% 0.3%

Internal HDFS 52.7% 32.8% 9.1% 3.1% 2.3%

MapReduce 59.3% 32.7% 5.6% 1.3% 1.0%

Half of them are from YARN.

Page 27: Understanding Issue Correlations: A Case Study of the ...jhuang95/papers/SoCC15-Jian.pdfHadoop Ecosystem . 6 Methodology Used in Our Study Computation Storage … HDFS HBase HCatalog

A

B C

7

Where Are the Correlated Issues From?

#Correlated Issues 0 1 2 3 >=4

External HDFS 94.7% 4.8% 0.5% - -

MapReduce 79.3% 17.1% 2.8% 0.5% 0.3%

Internal HDFS 52.7% 32.8% 9.1% 3.1% 2.3%

MapReduce 59.3% 32.7% 5.6% 1.3% 1.0%

Page 28: Understanding Issue Correlations: A Case Study of the ...jhuang95/papers/SoCC15-Jian.pdfHadoop Ecosystem . 6 Methodology Used in Our Study Computation Storage … HDFS HBase HCatalog

A

B C

7

Where Are the Correlated Issues From?

#Correlated Issues 0 1 2 3 >=4

External HDFS 94.7% 4.8% 0.5% - -

MapReduce 79.3% 17.1% 2.8% 0.5% 0.3%

Internal HDFS 52.7% 32.8% 9.1% 3.1% 2.3%

MapReduce 59.3% 32.7% 5.6% 1.3% 1.0%

Half of them are independent.

Page 29: Understanding Issue Correlations: A Case Study of the ...jhuang95/papers/SoCC15-Jian.pdfHadoop Ecosystem . 6 Methodology Used in Our Study Computation Storage … HDFS HBase HCatalog

A

B C

7

Where Are the Correlated Issues From?

#Correlated Issues 0 1 2 3 >=4

External HDFS 94.7% 4.8% 0.5% - -

MapReduce 79.3% 17.1% 2.8% 0.5% 0.3%

Internal HDFS 52.7% 32.8% 9.1% 3.1% 2.3%

MapReduce 59.3% 32.7% 5.6% 1.3% 1.0%

Page 30: Understanding Issue Correlations: A Case Study of the ...jhuang95/papers/SoCC15-Jian.pdfHadoop Ecosystem . 6 Methodology Used in Our Study Computation Storage … HDFS HBase HCatalog

8

How the Issues Are Correlated?

Do you know our relationship?

Page 31: Understanding Issue Correlations: A Case Study of the ...jhuang95/papers/SoCC15-Jian.pdfHadoop Ecosystem . 6 Methodology Used in Our Study Computation Storage … HDFS HBase HCatalog

8

How the Issues Are Correlated?

Similar Causes Issues have similar causes Blocking Other Issues Issues need to be fixed before fixing other issues Fix on Fix Issues are caused by fixing other issues

Page 32: Understanding Issue Correlations: A Case Study of the ...jhuang95/papers/SoCC15-Jian.pdfHadoop Ecosystem . 6 Methodology Used in Our Study Computation Storage … HDFS HBase HCatalog

8

How the Issues Are Correlated?

0

10

20

30

40

HDFS MapReduce

Perce

ntage

(%)

Similar Causes Blocking Other Issues Fix on Fix

26-33% of the issues have similar causes.

Page 33: Understanding Issue Correlations: A Case Study of the ...jhuang95/papers/SoCC15-Jian.pdfHadoop Ecosystem . 6 Methodology Used in Our Study Computation Storage … HDFS HBase HCatalog

8

How the Issues Are Correlated?

0

10

20

30

40

HDFS MapReduce

Perce

ntage

(%)

Similar Causes Blocking Other Issues Fix on Fix

These issues that block others appear more frequently in HDFS.

Page 34: Understanding Issue Correlations: A Case Study of the ...jhuang95/papers/SoCC15-Jian.pdfHadoop Ecosystem . 6 Methodology Used in Our Study Computation Storage … HDFS HBase HCatalog

8

How the Issues Are Correlated?

0

10

20

30

40

HDFS MapReduce

Perce

ntage

(%)

Similar Causes Blocking Other Issues Fix on Fix

Mostly due to functional dependency.

Page 35: Understanding Issue Correlations: A Case Study of the ...jhuang95/papers/SoCC15-Jian.pdfHadoop Ecosystem . 6 Methodology Used in Our Study Computation Storage … HDFS HBase HCatalog

9

Tools

Programming

Systems

On the Issue Correlations with System Characteristics

Page 36: Understanding Issue Correlations: A Case Study of the ...jhuang95/papers/SoCC15-Jian.pdfHadoop Ecosystem . 6 Methodology Used in Our Study Computation Storage … HDFS HBase HCatalog

9

Tools

Programming

Systems

47%

27%

26%

On the Issue Correlations with System Characteristics

Page 37: Understanding Issue Correlations: A Case Study of the ...jhuang95/papers/SoCC15-Jian.pdfHadoop Ecosystem . 6 Methodology Used in Our Study Computation Storage … HDFS HBase HCatalog

10

How Issues Relate to Systems?

Page 38: Understanding Issue Correlations: A Case Study of the ...jhuang95/papers/SoCC15-Jian.pdfHadoop Ecosystem . 6 Methodology Used in Our Study Computation Storage … HDFS HBase HCatalog

0 10 20 30 40 50 60 70 80 90

100

HDFS MapReduce

Perce

ntage

(%)

security networking storage file system memory cache 10

How Issues Relate to Systems?

Page 39: Understanding Issue Correlations: A Case Study of the ...jhuang95/papers/SoCC15-Jian.pdfHadoop Ecosystem . 6 Methodology Used in Our Study Computation Storage … HDFS HBase HCatalog

0 10 20 30 40 50 60 70 80 90

100

HDFS MapReduce

Perce

ntage

(%)

security networking storage file system memory cache 10

How Issues Relate to Systems?

•  LightWeightGSet Vs. java.util structure

•  Object cache for long lived object:

ReplicasMap, ReplicasInfo

GC is still the No.1 concern, memory-friendly objects are preferred.

Page 40: Understanding Issue Correlations: A Case Study of the ...jhuang95/papers/SoCC15-Jian.pdfHadoop Ecosystem . 6 Methodology Used in Our Study Computation Storage … HDFS HBase HCatalog

0 10 20 30 40 50 60 70 80 90

100

HDFS MapReduce

Perce

ntage

(%)

security networking storage file system memory cache 10

How Issues Relate to Systems?

File system semantic: namespace management, file permission,

consistency (e.g., fsck), etc.

Many issues happened in file system like EXT4 appear in Hadoop.

Page 41: Understanding Issue Correlations: A Case Study of the ...jhuang95/papers/SoCC15-Jian.pdfHadoop Ecosystem . 6 Methodology Used in Our Study Computation Storage … HDFS HBase HCatalog

0 10 20 30 40 50 60 70 80 90

100

HDFS MapReduce

Perce

ntage

(%)

security networking storage file system memory cache 10

How Issues Relate to Systems?

Issues in rack placement policy:

0.16% of blocks and their replicas are in the same rack upon system upgrade.

The statement of the 99.99% of data reliability in cloud storage is challenged.

Page 42: Understanding Issue Correlations: A Case Study of the ...jhuang95/papers/SoCC15-Jian.pdfHadoop Ecosystem . 6 Methodology Used in Our Study Computation Storage … HDFS HBase HCatalog

0 10 20 30 40 50 60 70 80 90

100

HDFS MapReduce

Perce

ntage

(%)

security networking storage file system memory cache 10

How Issues Relate to Systems?

One quarter of networking issues cause resource wastage.

Read a block:

Peer peer = newTcpPeer(dnAddr); -  return newBlockReader(…) + try{ + reader = newBlockReader(…) + return reader + } catch (IOException ex) { + throw ex; + } finally { + if(reader == null) closeQuietly(peer); + }

Socket leak !

Page 43: Understanding Issue Correlations: A Case Study of the ...jhuang95/papers/SoCC15-Jian.pdfHadoop Ecosystem . 6 Methodology Used in Our Study Computation Storage … HDFS HBase HCatalog

11

How Issues Relate to Programming?

0 10 20 30 40 50 60 70 80 90

100

HDFS MapReduce

Perce

ntage

(%)

typo lock interface maintenance

Page 44: Understanding Issue Correlations: A Case Study of the ...jhuang95/papers/SoCC15-Jian.pdfHadoop Ecosystem . 6 Methodology Used in Our Study Computation Storage … HDFS HBase HCatalog

11

How Issues Relate to Programming?

0 10 20 30 40 50 60 70 80 90

100

HDFS MapReduce

Perce

ntage

(%)

typo lock interface maintenance

Half of them relate to code maintenance.

Page 45: Understanding Issue Correlations: A Case Study of the ...jhuang95/papers/SoCC15-Jian.pdfHadoop Ecosystem . 6 Methodology Used in Our Study Computation Storage … HDFS HBase HCatalog

11

How Issues Relate to Programming?

0 10 20 30 40 50 60 70 80 90

100

HDFS MapReduce

Perce

ntage

(%)

typo lock interface maintenance

Mainly caused by interface changes.

Page 46: Understanding Issue Correlations: A Case Study of the ...jhuang95/papers/SoCC15-Jian.pdfHadoop Ecosystem . 6 Methodology Used in Our Study Computation Storage … HDFS HBase HCatalog

11

How Issues Relate to Programming?

0 10 20 30 40 50 60 70 80 90

100

HDFS MapReduce

Perce

ntage

(%)

typo lock interface maintenance

5.6% of programming issues are caused by typos !

A fsimage cannot be accessed due to:

-  elif [ “COMMAND” = “oiv_legacy” ] then + elif [ “$COMMAND” = “oiv_legacy” ] then

Page 47: Understanding Issue Correlations: A Case Study of the ...jhuang95/papers/SoCC15-Jian.pdfHadoop Ecosystem . 6 Methodology Used in Our Study Computation Storage … HDFS HBase HCatalog

12

How Issues Relate to Tools?

0 10 20 30 40 50 60 70 80 90

100

HDFS MapReduce

Perce

ntage

(%)

configuration debugging documents testing

Page 48: Understanding Issue Correlations: A Case Study of the ...jhuang95/papers/SoCC15-Jian.pdfHadoop Ecosystem . 6 Methodology Used in Our Study Computation Storage … HDFS HBase HCatalog

12

How Issues Relate to Tools?

0 10 20 30 40 50 60 70 80 90

100

HDFS MapReduce

Perce

ntage

(%)

configuration debugging documents testing

Logs are misleading: incorrect, incomplete, indistinct output.

Page 49: Understanding Issue Correlations: A Case Study of the ...jhuang95/papers/SoCC15-Jian.pdfHadoop Ecosystem . 6 Methodology Used in Our Study Computation Storage … HDFS HBase HCatalog

12

How Issues Relate to Tools?

0 10 20 30 40 50 60 70 80 90

100

HDFS MapReduce

Perce

ntage

(%)

configuration debugging documents testing

Logs are misleading: incorrect, incomplete, indistinct output.

Accessing a non-exist file via WebHDFS, FileNotFoundException is expected, but we get this

Logs

Page 50: Understanding Issue Correlations: A Case Study of the ...jhuang95/papers/SoCC15-Jian.pdfHadoop Ecosystem . 6 Methodology Used in Our Study Computation Storage … HDFS HBase HCatalog

12

How Issues Relate to Tools?

0 10 20 30 40 50 60 70 80 90

100

HDFS MapReduce

Perce

ntage

(%)

configuration debugging documents testing

A majority of configuration issues are related to system performance.

59% of the 219 configuration parameters in MapReduce are performance related.

Page 51: Understanding Issue Correlations: A Case Study of the ...jhuang95/papers/SoCC15-Jian.pdfHadoop Ecosystem . 6 Methodology Used in Our Study Computation Storage … HDFS HBase HCatalog

13

Conclusion

2

Correlations Between Issues Issues are independent; 33% of issues have similar causes, etc.

Correlations With System Characteristics More efforts are required to achieve highly reliable distributed system

1

Tools

Programming

Systems

Page 52: Understanding Issue Correlations: A Case Study of the ...jhuang95/papers/SoCC15-Jian.pdfHadoop Ecosystem . 6 Methodology Used in Our Study Computation Storage … HDFS HBase HCatalog

Jian Huang [email protected]

Xuechen Zhang† Karsten Schwan

Thanks!

Q&A