56
EXPANDING IDENTIFIERS TO NORMALIZING SOURCE CODE VOCABULARY PRESENTED BY DAWN LAWRIE LOYOLA UNIVERSITY MARYLAND IN COLLABORATION WITH DAVE BINKLEY Friday, October 7, 11

Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary

Embed Size (px)

DESCRIPTION

Paper: Expanding Identifiers to Normalize Source Code VocabularyAuthors: Dave Binkley and Dawn LawrieSession: Research Track 4: Natural Language Analysis

Citation preview

Page 1: Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary

EXPANDING IDENTIFIERS TO NORMALIZING SOURCE

CODE VOCABULARYPRESENTED BY DAWN LAWRIE

LOYOLA UNIVERSITY MARYLAND

IN COLLABORATION WITH DAVE BINKLEY

Friday, October 7, 11

Page 2: Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary

VOCABULARY MISMATCH

DIFFERENT VOCABULARY IN SOURCE CODE AND OTHER SOFTWARE ARTIFACTS

EXAMPLE

REQUIREMENT - “FEATURE LOCATION”

SOURCE CODE - “FEATURELOCATION”

OR WORSE “FLOC”

Friday, October 7, 11

Page 3: Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary

PURPOSE OF NORMALIZE

COPE WITH VOCABULARY MISMATCH

SOURCE CODE

OTHER SOFTWARE DOCUMENTS

Friday, October 7, 11

Page 4: Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary

EXAMPLE PROBLEMS

CONSIDER IDENTIFIERS

FEATURELOCATION

FLOC

Friday, October 7, 11

Page 5: Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary

EXAMPLE PROBLEMS

CONSIDER IDENTIFIERS

FEATURE LOCATION

FLOC

SPLITTING PROBLEM

Friday, October 7, 11

Page 6: Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary

EXAMPLE PROBLEMS

CONSIDER IDENTIFIERS

FEATURE LOCATION

F LOC

SPLITTING PROBLEM

SPLITTING PROBLEM

Friday, October 7, 11

Page 7: Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary

EXAMPLE PROBLEMS

CONSIDER IDENTIFIERS

FEATURE LOCATION

FEATURE LOCATION

SPLITTING PROBLEM

SPLITTING ANDEXPANSION PROBLEM

Friday, October 7, 11

Page 8: Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary

WHY NORMALIZE?

MANY SE PROBLEMS CAN BE ADDRESSED USING INFORMATION RETRIEVAL (IR) TECHNIQUES

UN-NORMALIZED CODE LEADS TO AN UNDER ESTIMATE OF THE IMPORTANCE OF CRUCIAL WORDS

Friday, October 7, 11

Page 9: Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary

NORMALIZE PROBLEM STATEMENT

FIND THE BEST EXPANSION OVERALL POSSIBLE SPLITS

FLOC FEATURE LOCATION

Friday, October 7, 11

Page 10: Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary

NORMALIZE ALGORITHM

TERMINOLOGY

HARD-WORD - WHITEHOUSE_LAWN

SOFT-WORD - WHITE-HOUSE_LAWN

Friday, October 7, 11

Page 11: Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary

NORMALIZE ALGORITHM

TERMINOLOGY

HARD-WORD - WHITEHOUSE_LAWN

SOFT-WORD - WHITE-HOUSE_LAWN

(2)

Friday, October 7, 11

Page 12: Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary

NORMALIZE ALGORITHM

TERMINOLOGY

HARD-WORD - WHITEHOUSE_LAWN

SOFT-WORD - WHITE-HOUSE_LAWN

(2)

(3)

Friday, October 7, 11

Page 13: Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary

NORMALIZE ALGORITHM

Friday, October 7, 11

Page 14: Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary

NORMALIZE ALGORITHM

STRLEN STRING LENGTH

Friday, October 7, 11

Page 15: Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary

MACHINE TRANSLATION APPROACH

EL PAPA VISITA LA IGLESIA

Friday, October 7, 11

Page 16: Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary

MACHINE TRANSLATION APPROACH

EL PAPA VISITA LA IGLESIA

THEFATHERPOTATOPOPE

VISITSVISITORHIT

THE CHURCH

Friday, October 7, 11

Page 17: Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary

MACHINE TRANSLATION APPROACH

EL PAPA VISITA LA IGLESIA

THEFATHERPOTATOPOPE

VISITSVISITORHIT

THE CHURCH

Friday, October 7, 11

Page 18: Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary

MACHINE TRANSLATION APPROACH

EL PAPA VISITA LA IGLESIA

THEFATHERPOTATOPOPE

VISITSVISITORHIT

THE CHURCH

STRONG COHESION

Friday, October 7, 11

Page 19: Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary

MACHINE TRANSLATION APPROACH

EL PAPA VISITA LA IGLESIA

THEFATHERPOTATOPOPE

VISITSVISITORHIT

THE CHURCH

STRONG COHESION

Friday, October 7, 11

Page 20: Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary

NORMALIZE ALGORITHM

Friday, October 7, 11

Page 21: Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary

NORMALIZE ALGORITHM

STRLEN

Friday, October 7, 11

Page 22: Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary

NORMALIZE ALGORITHM

STRLENS-TRLEN

ST-RLEN

STR-LENSTRL_ENSTRLE_NS_T_RLENS-TR-LENS_TRL_ENS_TRLE_NST_R_LENST_RL_ENST_RLE_NSTR_L_ENSTR_LE_NSTRL_E_NS_T_R_LENS_T_RL_ENS_T_RLE_NS_TR_L_ENS_TR_LE_NS_TRL_E_NST_R_L_ENST_R_LE_NST_RL_E_NSTR_L_E_NS_T_R_L_ENS_T_R_LE_NS_TR_L_E_NST_R_L_E_NS-T-R-L-E-N

Friday, October 7, 11

Page 23: Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary

NORMALIZE ALGORITHM

STRLENS-TRLEN

ST-RLEN

STR-LENSTRL_ENSTRLE_NS_T_RLENS-TR-LENS_TRL_ENS_TRLE_NST_R_LENST_RL_ENST_RLE_NSTR_L_ENSTR_LE_NSTRL_E_NS_T_R_LENS_T_RL_ENS_T_RLE_NS_TR_L_ENS_TR_LE_NS_TRL_E_NST_R_L_ENST_R_LE_NST_RL_E_NSTR_L_E_NS_T_R_L_ENS_T_R_LE_NS_TR_L_E_NST_R_L_E_NS-T-R-L-E-N

E(RLEN) = {RIFLEMEN}

Friday, October 7, 11

Page 24: Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary

NORMALIZE ALGORITHM

STRLENS-TRLEN

ST-RLEN

STR-LENSTRL_ENSTRLE_NS_T_RLENS-TR-LENS_TRL_ENS_TRLE_NST_R_LENST_RL_ENST_RLE_NSTR_L_ENSTR_LE_NSTRL_E_NS_T_R_LENS_T_RL_ENS_T_RLE_NS_TR_L_ENS_TR_LE_NS_TRL_E_NST_R_L_ENST_R_LE_NST_RL_E_NSTR_L_E_NS_T_R_L_ENS_T_R_LE_NS_TR_L_E_NST_R_L_E_NS-T-R-L-E-N

E(RLEN) = {RIFLEMEN}

WILDCARD EXPANSION

R*L*E*N*

Friday, October 7, 11

Page 25: Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary

NORMALIZE ALGORITHM

STRLENS-TRLEN

ST-RLEN

STR-LENSTRL_ENSTRLE_NS_T_RLENS-TR-LENS_TRL_ENS_TRLE_NST_R_LENST_RL_ENST_RLE_NSTR_L_ENSTR_LE_NSTRL_E_NS_T_R_LENS_T_RL_ENS_T_RLE_NS_TR_L_ENS_TR_LE_NS_TRL_E_NST_R_L_ENST_R_LE_NST_RL_E_NSTR_L_E_NS_T_R_L_ENS_T_R_LE_NS_TR_L_E_NST_R_L_E_NS-T-R-L-E-N

E(ST) = {SET, STOP, STRING}E(RLEN) = {RIFLEMEN}

E(STR) = {STEER, STRING}E(LEN) = {LENDER, LENGTH}

Friday, October 7, 11

Page 26: Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary

NORMALIZE ALGORITHM PART I

STRING STEER

VSSTR

Friday, October 7, 11

Page 27: Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary

NORMALIZE ALGORITHM PART I

STRING STEERLENDERLENGTH

LENDERLENGTH

VSSTR

Friday, October 7, 11

Page 28: Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary

NORMALIZE ALGORITHM PART I

STRING STEERLENDERLENGTH

LENDERLENGTH

VS

1. FIND COHESION BY SUMMING LOG OF PROBABILITIES OF WORD PAIRS

STR

Friday, October 7, 11

Page 29: Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary

NORMALIZE ALGORITHM PART I

STRING STEERLENDERLENGTH

LENDERLENGTH

VS

+ +

1. FIND COHESION BY SUMMING LOG OF PROBABILITIES OF WORD PAIRS

COHESIONBCOHESIONA

STR

Friday, October 7, 11

Page 30: Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary

NORMALIZE ALGORITHM PART I

STRING STEERLENDERLENGTH

LENDERLENGTH

VS

+ +

1. FIND COHESION BY SUMMING LOG OF PROBABILITIES OF WORD PAIRS

2. SELECT EXPANSION THAT MAXIMIZES COHESION

COHESIONBCOHESIONA

STR

Friday, October 7, 11

Page 31: Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary

NORMALIZE ALGORITHM PART I

STRING STEERLENDERLENGTH

LENDERLENGTH

VS

+ +

1. FIND COHESION BY SUMMING LOG OF PROBABILITIES OF WORD PAIRS

2. SELECT EXPANSION THAT MAXIMIZES COHESION

COHESIONBCOHESIONA

STR

Friday, October 7, 11

Page 32: Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary

NORMALIZE ALGORITHM PART I

STRING STEERLENDERLENGTH

LENDERLENGTH

VS

+ +

1. FIND COHESION BY SUMMING LOG OF PROBABILITIES OF WORD PAIRS

2. SELECT EXPANSION THAT MAXIMIZES COHESION

COHESIONBCOHESIONA

STRING

STR

Friday, October 7, 11

Page 33: Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary

NORMALIZE ALGORITHM PART II

VS

STR-LEN ST-RLEN

Friday, October 7, 11

Page 34: Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary

NORMALIZE ALGORITHM PART II

VS

STR-LEN ST-RLENSTRING LENGTH STOP RIFLEMEN

Friday, October 7, 11

Page 35: Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary

NORMALIZE ALGORITHM PART II

VS

STR-LEN ST-RLENSTRING LENGTH STOP RIFLEMEN

1. FIND COHESION OVER EXPANSIONS

Friday, October 7, 11

Page 36: Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary

NORMALIZE ALGORITHM PART II

VS

STR-LEN ST-RLENSTRING LENGTH STOP RIFLEMEN

1. FIND COHESION OVER EXPANSIONS

2. SELECT EXPANSION OF THE SPLIT THAT MAXIMIZES COHESION

Friday, October 7, 11

Page 37: Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary

NORMALIZE ALGORITHM PART II

VS

STR-LEN ST-RLENSTRING LENGTH STOP RIFLEMEN

1. FIND COHESION OVER EXPANSIONS

2. SELECT EXPANSION OF THE SPLIT THAT MAXIMIZES COHESION

Friday, October 7, 11

Page 38: Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary

NORMALIZE ALGORITHM PART II

VS

STR-LEN ST-RLENSTRING LENGTH STOP RIFLEMEN

1. FIND COHESION OVER EXPANSIONS

2. SELECT EXPANSION OF THE SPLIT THAT MAXIMIZES COHESION

STRING LENGTH

Friday, October 7, 11

Page 39: Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary

ADDING CONTEXT

Friday, October 7, 11

Page 40: Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary

ADDING CONTEXT

DIR

Friday, October 7, 11

Page 41: Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary

ADDING CONTEXT

DIR E(DIR) = {DIRECTION, DIRECTORY}

Friday, October 7, 11

Page 42: Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary

ADDING CONTEXT

DIR E(DIR) = {DIRECTION, DIRECTORY}

CONTEXT = {FORWARD, BACKWARD}

Friday, October 7, 11

Page 43: Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary

ADDING CONTEXT

FIND COHESION WITH CONTEXT WORDS IN ADDITION TO EXPANSIONS OF OTHER SOFT WORDS

USED IN BOTH PART 1 AND PART 2

DIR E(DIR) = {DIRECTION, DIRECTORY}

CONTEXT = {FORWARD, BACKWARD}

Friday, October 7, 11

Page 44: Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary

NORMALIZE IMPLEMENTATION

USES GenTest TO SPLIT IDENTIFIERS

RETURNS MULTIPLE SPLITS

GOOGLE 5-GRAM DATASET

Friday, October 7, 11

Page 45: Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary

EVALUATION

Program Loc SLoc Unique Ids

which-2.20 3,670 2,293 487

a2ps-4.14 62,347 38,436 4,393

Program Selected Ids Hard Words Soft Words

which-2.20 487 903 1214

a2ps-4.14 211 459 618

Friday, October 7, 11

Page 46: Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary

EVALUATION

THREE GROUPS OF IDENTIFIERS

STANDARD LIBRARY CALLS

NAMES FROM STANDARD HEADER FILES / KEYWORDS

DOMAIN NAMES

Friday, October 7, 11

Page 47: Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary

EVALUATION

THREE GROUPS OF IDENTIFIERS

STANDARD LIBRARY CALLS

NAMES FROM STANDARD HEADER FILES / KEYWORDS

DOMAIN NAMES

THREE GROUPS OF IDENTIFIERS

DOMAIN NAMES

Friday, October 7, 11

Page 48: Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary

EVALUATION

THREE GROUPS OF IDENTIFIERS

STANDARD LIBRARY CALLS

NAMES FROM STANDARD HEADER FILES / KEYWORDS

DOMAIN NAMES

THREE GROUPS OF IDENTIFIERS

DOMAIN NAMES

Program Filtered Ids Reported Ids

which-2.20 152 335

a2ps-4.14 46 166

Friday, October 7, 11

Page 49: Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary

EXAMPLE EXPANSIONS

id Top 10 Expansion

Top Expansion

nextchar next_character next_character

indfound index_found_need index_found

optarg option_are_g optarg

itemno i_them_not itemno

Friday, October 7, 11

Page 50: Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary

RESEARCH QUESTIONS

WHAT IS THE OVERALL ACCURACY OF NORMALIZE?

DOES THE VOCABULARY USED HAVE A SIGNIFICANT IMPACT ON THE EXPANSION’S ACCURACY?

CAN THE EXPANDER INFORM THE SPLITTER?

CAN THE SPLITTER INFORM THE EXPANDER?

Friday, October 7, 11

Page 51: Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary

ACCURACY ON DOMAIN IDS

Friday, October 7, 11

Page 52: Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary

SOURCE OF EXPANSION WORDS

SOURCE CODE

INTERNAL DOCUMENTATION

MANUAL

Friday, October 7, 11

Page 53: Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary

BEST VOCABULARY SOURCE?

Friday, October 7, 11

Page 54: Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary

FUTURE WORK

EXPLORING DIFFERENT SOURCES OF CO-OCCURRENCE DATA

EXPLORING DIFFERENT WAYS OF CALCULATING PROBABILITIES

EXAMINING NORMALIZATION IN CONTEXT OF AN INFORMATION RETRIEVAL TASK

Friday, October 7, 11

Page 55: Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary

SUMMARY

IDENTIFIERS ARE WRITTEN DIFFERENTLY THAN OTHER SOFTWARE DOCUMENTS

DEGRADES PERFORMANCE OF IR TECHNIQUES

NORMALIZE CURRENTLY EXPANDS ABOUT HALF OF SOFT WORDS CORRECTLY

Friday, October 7, 11

Page 56: Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary

QUESTIONS?

Need an identifier split?GenTest Splitter available at

splitit.cs.loyola.edu

Friday, October 7, 11