12
 Requirements Validation via Automated Natural Language Parsing SASTRY NANDURI AND SPENCER RUGABER SASTRY NANDURI  was a graduate student in the College of Computing at Georgia Tech when the work described in this paper was performed. He has since graduated with a master s degree in computer science. SPENCER RUGABER  is  a Senior Research Scientist in the College of Computing at the Georgia Institute of Technology. Since receiving his Ph.D. from Yale University in 1978,  Dr. Rugaber has worked in industry where he was responsible for the develop- ment of programming language tools, TEN/PLUS, an integrated environment for interacting with  the  UNIX operating  system,  and environments for improving software productivity. Since January 1988, Dr. Rugaber has been at Georgia Tech where he teaches courses  in  programming language and systems and software engineering. His research has concentrated on software engineering, specifically program understand- ing and user interface design. He is currently Vice Chair of the Reverse Engineering committee of the IEEE Technical Council on Software Engineering. ABSTRACT:  Object-oriented analysis (OOA) has become a popular method for ana- lyzing system requirements. Unfortunately, however, none of the current versions of OOA has included a validation technique tailored to the object-oriented approach. Most, instead, merely recommend document reviews without specifying the kinds of problems to look for. This paper explores the question by applying a natural language parser to a requirements document, extracting candidate objects, methods, and asso- ciations, composing them into an object model diagram, and then comparing the results with those determined by manual OOA. To do this, we have adapted an automated natural language parser and used it to examine several high-level system descriptions. The results indicate that, with a modest amount of effort, the teclmique can give valuable feedback to the analyst. KEY WORDS AND PHRASES:  link grammars, natural language processing, object-ori- ented analysis, systems analysis. THE FIRST STEP  IN  MOST OBJECT-ORIENTED ANALYSIS METHODS  is the construction of an object model. Many guidelines exist for identifying object classes, their relation- ships,  and their attributes fr om a problem  statement Are  these guidelines comprehens- ive enough for automating the construction process? Probably not, but can we instead use automatic analysis to generate a model against which a manually generated  cknowtedgment Th e  uthors wis h  to  thank  BNR, Inc.,  for  their  gift  in  support of  this  research. Journal  of  Management nformation Systems  Winter 1995 -96 Vol. 12 No. 3 pp. 9-1 9 Copyright © 1996 M.E. Sharpe Inc.

3. Requirements Validation via Automated Natural Language Parsing

Embed Size (px)

Citation preview

  • Requirements Validation via AutomatedNatural Language ParsingSASTRY NANDURI AND SPENCER RUGABER

    SASTRY NANDURI was a graduate student in the College of Computing at GeorgiaTech when the work described in this paper was performed. He has since graduatedwith a master's degree in computer science.

    SPENCER RUGABER is a Senior Research Scientist in the College of Computing at theGeorgia Institute of Technology. Since receiving his Ph.D. from Yale University in1978, Dr. Rugaber has worked in industry where he was responsible for the develop-ment of programming language tools, TEN/PLUS, an integrated environment forinteracting with the UNIX operating system, and environments for improving softwareproductivity. Since January 1988, Dr. Rugaber has been at Georgia Tech where heteaches courses in programming language and systems and software engineering. Hisresearch has concentrated on software engineering, specifically program understand-ing and user interface design. He is currently Vice Chair of the Reverse Engineeringcommittee of the IEEE Technical Council on Software Engineering.

    ABSTRACT: Object-oriented analysis (OOA) has become a popular method for ana-lyzing system requirements. Unfortunately, however, none of the current versions ofOOA has included a validation technique tailored to the object-oriented approach.Most, instead, merely recommend document reviews without specifying the kinds ofproblems to look for. This paper explores the question by applying a natural languageparser to a requirements document, extracting candidate objects, methods, and asso-ciations, composing them into an object model diagram, and then comparing theresults with those determined by manual OOA. To do this, we have adapted anautomated natural language parser and used it to examine several high-level systemdescriptions. The results indicate that, with a modest amount of effort, the teclmiquecan give valuable feedback to the analyst.

    KEY WORDS AND PHRASES: link grammars, natural language processing, object-ori-ented analysis, systems analysis.

    THE FIRST STEP IN MOST OBJECT-ORIENTED ANALYSIS METHODS is the construction ofan object model. Many guidelines exist for identifying object classes, their relation-ships, and their attributes from a problem statement. Are these guidelines comprehens-ive enough for automating the construction process? Probably not, but can we insteaduse automatic analysis to generate a model against which a manually generated

    Acknowtedgment. The authors wish to thank BNR, Inc., for their gift in support of this research.

    Journal of Management Information Systems I Winter 1995-96, Vol. 12, No. 3, pp. 9-19Copyright 1996, M.E. Sharpe, Inc.

  • 10 NANDURI AND RUGABER

    analysis can be validated? The best way to answer these questions is by actually tryingto implement a program for generating an object diagram from a system description.

    There are several papers that discuss the possibility of automatic construction of anobject model. Honiden et al. [6] developed a standardized, formal OOA specificationprocess as a precursor to automation. Seki et al. [ 11 ] describe a process for incremen-tally deriving a formal specification from an informal specification. Abbott [1] detailsa method for generating a program design from an informal English description. Allthese papers only give suggestions as to how to automate the analysis process. Wecould fmd no description in the published literature of the actual implementation ofan object model constructor. The purpose of this paper is to describe one suchimplementation.

    Traditional approaches, such as Structured Analysis, focus mainly on the functional-ity of a system. OOA, on the other hand, focuses on the objects or static entities of thesystem and the associations among them. A main reason for its popularity is the factthat a system designed around static entities is less affected by subsequent changes inthe requirements than one organized functionally.

    Booch [2] was the first to formalize the object-oriented approach. Now there areseveral popular object-oriented methods such as OOA by Coad and Yourdon [4],object-oriented design (OOD) by Booch [3], and object modeling technique (OMT)by Rumbaugh et al. [10]. The methods have much in common. They all start with thedetection of objects in the system by textual analysis of a specification document.After objects are identified, the system is modeled in terms of their attributes and theinteractions among them. As originally proposed, nouns in the specification documentare good indicators of objects. Similarly, verbs and adjectives are good signals for theassociations among and attributes of the objects.

    We chose OMT because of its popularity and extensive documentation. Of the threeOMT models (object, dynamic, and functional), we are only concerned with the objectmodel since it is the most conducive to textual analysis. The approach recommendedby Rumbaugh et al. for object modeling has the following steps:

    1. Identify objects and classes (nouns);2. Identify associations between objects (verb phrases);3. Identify attributes of objects and associations (adjectives);4. Identify operations (verbs and adjectives);5. Organize and simplify object classes using inheritance^;6. Iterate and refine the model.

    Approach

    THE APPROACH WE TOOK TO AUTOMATING THE ANALYSIS PROCESS is as follows. Wefirst collected a list of guidelines on object modeling from Rumbaugh et al. [10]. Weexpressed the guidelines in terms of the parsing rules for a publicly available naturallanguage parser. We then applied these guidelines to the parser output of severalhigh-level system descriptions and analyzed the results. After refining the parsing

  • REQUIREMENTS VALIDATION II

    rules, we repeated the process for several other documents. Finally, we implementeda program that used the refined guidelines and a graph drawing tool to display objectdiagrams for the systems analyzed.

    More precisely, the steps we followed in building our analyzer were:

    1. We gathered guidelines for creating an object model and for identifyingnouns, verbs, and the like, from the OMT text.

    2. We used a publicly available natural language parser to parse a systemdescription document. We chose a link grammar-based parser because it waseasily available, because it was able to parse a wide range of Englishsentences, and because its dictionary was easily extended.

    3. We wrote a rule-based postprocessor. These rules are nothing but theguidelines gathered in step 1 expressed in terms of the parser's links.

    4. We extended our tool to accumulated knowledge between sentences. Theparser we used parsed each sentence of the input independently. So, for theconstruction of an object model from a multisentence document, we had toaccumulate knowledge gained from each of the sentences. Since there is nofoolproof way of connecting the pronouns in a sentence with the appropriateantecedent in the previous sentence, we also had to apply some ad hoc rulesfor doing this.

    5. Finally, we built and displayed the object model. For displaying the objectdiagram graphically, we used a publicly available graph drawing tool calledEDGE [8]. EDGE is an easy to use, extendible graph editor. Graphs can beconstructed either interactively or by creating an input file consisting of a listof nodes and edges. The program we developed for step 3 took the lattercourse and wrote the accumulated knowledge about the objects and associ-ations into a file in a format understandable by EDGE.

    Natural Language Parsing Using Link Grammars

    The parser we used was developed by Daniel D. Sleatoi- and Davy Temperley atCarnegie Mellon University [12]. This parser is based on the theory of link grammars.A link grammar consists of a set of words (the terminal symbols of the grammar), eachof which has one or more linking requirements. A sequence of words is a sentence ofthe language defined by the grammar if there is a way to assign links among the wordsso as to satisfy the following three conditions:

    1. Planaritythe links do not cross;2. Connectivitythe links suffice to connect all the words of the sequence

    together;3. Satisfactionthe links satisfy the linking requirements of each word in the

    sequence.

    The linking requirements of each word are contained in a dictionary where they areexpressed as a formula involving the operators and and or, parentheses, and connector

  • 12 NANDURI AND RUGABER

    names. The + o r - suffix on a connector name indicates the direction relative to theword being defined in which the matching connector must lie. For example, the linkingrequirements for the words "Mary" and "ran" are shown below:

    Mary:O-orS+ran: S-

    That is, "Mary" can act as a direct object if a corresponding verb is on the left or as asubject if a corresponding verb is on the right. "Ran" expects a subject on its left. Thelinking requirements are satisfied in the sentence "Mary ran" but not in the sentence"ran Mary." Hence, the latter is not accepted by the parser.

    The output ofthe parser consists of all the words along with the links satisfying thelinking requirements. We found that the connectors used to express the linkingrequirements {0 and 5 in the example above) are useful in identifying the nouns andverb phrases of a sentence and, thereby, the object classes and their associations. Forexample, the output upon running the parser on the sentence "Mary ran" consists ofthe following connectors:

    Mary ran

    This output indicates that the word "Mary," which is on the left side ofthe link, hasthe connector S+ and that the word "ran" has the connector S-. We can gather bylooking at the S+ connector that "Mary" is the subject in the above sentence. Likewise,"ran" is identified as a verb because it has the connector S-.

    For a better understanding of how the output ofthe parser can be used to identifythe classes, relationships, and so on, consider the following sentence "The companypays the employees." The parser's output for this sentence is:

    + D + S + +E)+I I I I Ithe company.n pays the employees.n

    "Con:q)any" is recognized as a noun (the ".n" suffix); "the" is a determiner; "pays" is averb expecting a subject and a direct object, with "the employees" serving the latter role.

    By using the guideline that any sentence ofthe form Subject -S- verb -O- Objectimplies that the classes Subject and Object have the association named by the verb,we can infer that company and employees are classes and that pays is an associationbetween them. Most of the guidelines for object identification can be expressed interms of links in this way.

    Object and Association Detection GuidelinesRumbaugh et al. [ 10] give some practical tips on how to fmd classes, associations, andattributes in a problem statement. They also suggest how to eliminate bad classes.

  • REQUIREMENTS VALIDATION 13

    associations, and the like. Both these and the other guidelines in the literature are verygeneral. A guideline of the form "names that primarily describe individual objectsshould be restated as attributes," though helpful for a human designer, cannot easilybe incorporated into program logic. The main problem when trying to incorporate suchknowledge is how the program can find out which names describe individual objects.Most of our effort in working on this project was spent on expressing the existingguidelines in terms of the connectors. That is, we transformed the guidelines intoprecise rules in terms of the parser's output.

    Examples ofthe rules that we used to implement our program are given below. Forevery word in the input sentence, the program checks if any of the guidelines aresatisfied. If this is the case, then the corresponding inference is made.

    1. If a word is "with" and if it has 7 (preposition) and M(participle) connectors,then the class described by the word with the M connector is an aggregation^ofthe class described by the word with the /connector. Example: A sentencecontaining "building with floors . . . " indicates that building is an aggrega-tion of floors.

    2. For every verb, if there is an EV (verb used with prepositional phrases)connector, get the 7 connector ofthe K connector, if it exists. Or if there isa V connector, and the V connector has an / connector and the / (indirectobject) connector has a TO (infinitive) connector and the TO connector hasan S connector, get the fmal S connector. If either of the above-mentionedconnectors exists, the two words are candidate classes, and they have anassociation named by the verb. Example: A sentence ofthe form "A systemis to be installed in a building" indicates that system and building have anassociation installed.

    3. If a verb has a K connector followed by an 5 connector, get the S connector.Also get theJconnector following the fKconnector. These words are classes,and they have the association given by the verb. Example: A sentencecontaining "candidate is fired by the company" indicates that there is anassociation jlred between candidate and company.

    4. If a verb is "becomes," then the O connector is a state (attribute) of the Sconnector. Example: A sentence containing "person becomes candidate"indicates that candidate (candidacy) is an attribute ofthe class person.

    The Postprocessor

    The postprocessor that we developed applies the guidelines to the parser output andproduces the list of objects, their attributes, and the associations among them. Thepostprocessor is written in the C programming language and is integrated into theparser code. We took this approach because the number of guidelines that we had wasnot very large. If the number of guidelines becomes larger, it may prove advantageousto separate the postprocessor code from the parser code.

  • 14 NANDURI AND RUGABER

    The postprocessor also makes use of a file of synonyms while processing the outputof the parser. The synonyms are used for two purposes:

    To avoid creating redundant objects for synonymous nouns in a document; and To recognize plurals (e.g., "companies," "company") and different tenses of

    verbs ("fire," "fires," "fired," etc.) as the same.^After all the sentences of the input document are parsed and processed, the

    postprocessor writes the gathered information into two files: Output_file and"graph." Output Jile, the name of which is specified by the user, contains the resultsin English. That is, it will have statements like "building is an aggregation of fioors."The file with the name "graph" has the same information written in a form under-standable by EDGE.

    Example

    THIS SECTION DESCRIBES THE ACTION OF THE SYSTEM in analyzing a description of asmall database system for an employment agency. It is taken verbatim from [7].

    The Original DescriptionPersons apply for positions, companies subscribe by offering positions, and compa-nies hire candidates or fire employees. A person may apply only once, thus becominga candidate, losing this status when hired by a company but regaining it if fired; a com-pany may subscribe several times, the positive number of offerings being added up; fi-nally, only persons that are currently candidates may be hired, and only by companiesthat have vacant positions. There are queries to check whether a person is a candidate,for fmding out the company a person works for (provided that the person is not a can-didate), and for finding out the number of vacant positions a company still has (pro-vided that the company has ever subscribed). Initially, no person is a candidate and nocompany has subscribed.

    Modified System Description

    The link grammar parser has difficulty parsing certain constructs, requiring manualmodification of the input text. Our intent is to only make modifications dependent onthe parser and not on domain knowledge required to understand the text. Aftermodification, the text reads as follows:

    Persons apply for positions, companies subscribe by offering positions, and compa-nies hire candidates or fire employees. A person may apply only one time. The personbecomes a candidate when he applies. A person loses his status as a candidate whenhired by a company. A person becomes a candidate again if he is fired by the com-pany. A company may subscribe several times. Only persons that are still candidatescan be hired by companies. Only companies that have vacant positions can hire candi-dates. There is a query to check if a person is a candidate. There is a query to find thecompany a person works for. There is a query to find out the number of vacant posi-tions at a company. Initially, no person is a candidate and no company has subscribed.

  • REQUIREMENTS VALIDATION 15

    Program Output

    The program generates the output shown in Table 1, where associations are indicatedby triples of the form: Object ClassAssociationObject Class\ operations andattributes are indicated by stating their name and the class to which they belong;inheritance and aggregation associations are suggested; and synonyms are placed inparentheses. Note also that the parser may produce duplicate suggestions, which havebeen manually removed from Table 1.

    Object DiagramThe results of running the parser are placed in a file that is then fed to the EDGE graphdrawing tool. For the input text given above, the diagram shown in Figure 1 is produced.Classes are contained in rectangles; arcs denote associations. Classes contain three parts:the class name, its attributes, and its operations. Note that there are several ways in whichthe diagram could be easily improved. For example, an association that holds between oneclass and a second as well as between the first class and a subclass of the second could bereplaced by a single association. Also, situations where an operation and an associationon a class have the same name could be detected and resolved.

    Results

    WE APPLIED THE PARSER TO FOUR HIGH-LEVEL SYSTEM DESCRIPTIONS commonlychosen as examples in software engineering textbooks.

    Helicopter Landing

    The helicopter system description has been taken from [5]. Before running the parseron it, we had to rephrase some of the sentences as simple sentences. There was notmuch information that we could gather from the parser's output. The main problemwas with the system description itself, which described the history of the problemrather than stating the requirements. We feel that even manual construction of an objectdiagram from this description is not possible.

    Automatic Teller Machine (ATM)We obtained another system description from chapter 8 of reference [10]. Here, theproblem was stated very clearly, probably because it was used to illustrate theconstruction of an object model. The result of applying our rules to this descriptionwas encouraging. The resulting object diagram was reasonably close to that producedby hand. There were some differences, but these were all minor. For example, severalextra classes such as system and cost were suggested. The main reason why ourapproach produced them was because recognizing these classes as bad classes requireddomain knowledge, which neither the parser nor the postprocessor had.

  • 16 NANDURI AND RUGABER

    Table 1. Parser Output for Employment Database Example

    persons (person) apply positionscompanies (company) hirecandidates (candidate)companies (company)fireemployees (employee)person applyti mecandidate is an attribute value of class personbecomes (become) is an operation of class personapplies (apply) is an operation of class he (person)person loses (lose) statusloses (lose) is an operation of class personcompany hired (hire) personcandidate is an attribute value of class personhe (person) fired (fire) companycompany subscribe timesstill is an attribute value of class candidates (candidate)hired (hire) is an operation of class persons (person)vacant is an attribute value of class positionshire is an operation of class companies (company)class candidate can be a subclass of class personsubscribed (subscribe) is an operation of class company

    The Lift System

    The lift system description is taken from [9]. Most of the differences in the objectdiagram produced from using our approach and that produced manually were due tothe inadequacy of the parser in capturing some aspects of English grammar. And therules that we used were not powerful enough to offset the parser's weakness. Here isan example of the type of problem we experienced. In the sentence, "Each lift has aset of buttons, one for each floor," ideally the parser should have recognized that theword "one" refers to a button. From the parser's output, we could not derive a generalrule for recognizing the classes and associations correctly in a sentence of this form.

    Employment Database

    The results of applying our approach to this system description were also encouraging.We realized, however, that the rules approach to finding objects and association hassome drawbacks. We found that some guidelines cannot easily be expressed with rules.For example, from the sentence, "there is a query to check if a person is a candidate,"our program identified "query" as an object instead of an operation, which would be

  • REQUIREMENTS VALIDATION 17

    conpany

    subscribehire

    applybecomelosehire

    appiy lose

    tine status

    Figure I. Object Diagram for Employment Database Example

    desirable if we were building a general database management system but is suboptimalfor a small, special-piupose program. Furthermore, because of the way the rule isdefmed, our program would have identified "way" as an object in the followingsentence: "There is a way to check if a person is a candidate."

    Conclusions and Future Work

    WITH A RELATIVELY SMALL AMOUNT OF WORK (about three weeks and under 800 linesof code), we were able to build a tool capable of producing object diagrams that couldbe compared with those produced by hand. Among the uses of the resulting diagramwould be detection of missing classes, suggestion of alternative modeling choices(attribute versus class or operation versus association), and discovery of missingassociations.

  • 18 NANDURI AND RUGABER

    Limitations

    As described in the previous section, the object diagrams generated by our approachwere not completely satisfactory. The failure in producing completely acceptableobject diagrams is due to the following factors:

    Parser inadequacy: While the breadth of English text that the parser accepts isquite impressive, it still cannot handle many sentences. For example, the parserdoes not accept hyphenated words, idiomatic expressions, and quotation marks.And it cannot connect a pronoun with the corresponding noun. We haveovercome these problems to some extent by rephrasing some input sentences ina form acceptable to the parser. For the process to be completely automated, theparser should be powerful enough to accept all types of sentences.

    Ambiguous or incomplete descriptions: Sentences of the form "ATM acceptscash cards" can be difficult to deal with when the reader is a computer. Wheredoes the ATM accept the cash card from? While an intelligent human under-stands this from the context, it is difficult for a program to do the same.

    Lack of domain knowledge: In the ATM document a lot of domain knowledgewas required to generate the object diagram. Knowledge such as "bank holdsaccount" and "customers have cash cards" was assumed and not explicitlymentioned in the text. A practical automated analysis tool will need domainknowledge and common sense. There are two approaches to the solution of thisproblem. We can incorporate the domain knowledge in the parser. Or we canwrite the system description without assuming any domain knowledge on thepart of the user. Both the approaches have practical difficulties.

    Inadequacy of guidelines: Most of the rules we derived work for general cases.But they cannot handle special situations. For example, consider the followingsentence from the employment database system description: "There is a queryto check if a person is a candidate." We can make query a class or an operationof the cXassperson. A human needs to look at the overall structure of the objectdiagram and use his or her experience to decide whether to make query an objector an operation. Since all our rules are based on the structure of the sentence,such decisions cannot be made by the program.

    Future Work

    As noted earlier, a program that will automatically produce a perfect object diagramfrom a system description document is not currently possible. But our program can beused to validate a manually produced object diagram and to generate a preliminaryobject diagram that can be refined by human designers. The following enhancementssuggest themselves as fiiture directions for our research:

    Testing the tool on real-life (lower) level documents, particularly those with aspecialized domain vocabulary.

    Comparing the results of our semiautomatic validation with those produced in

  • ^ ^ REQUIREMENTS VALIDATION 19

    a design review, both for thoroughness and cost-effectiveness.Testing the program with a wide range of documents and refining the guidelinesappropriately.Using a parser other than the link grammar parser to see if some of the parsinglimitations can be overcome.

    NOTES

    1. Inheritance is a special kind of association indicating that one object class is a specialcase of another.

    2. Aggregation is a special kind of association that indicates that objects in one class canbe considered as part of objects in the other.

    3. An alternative is a root word extractor such as that used by the UNIX "spell" command.

    REFERENCES

    1. Abbott, R.J. Program design by informal English descriptions. Communications oftheACM, 12, 11 (November 1983), 882-894.

    2. Booch, G. Object-oriented development. IEEE Transactions on Software Engineering,/2,2 (February 1986).

    3. Booch, G. Object-Oriented Design with Applications. Redwood City, CA: Benja-min/Cummings, 1991.

    4. Coad, P., and Yourdon, E. Object-Oriented Analysis, 2d ed. Englewood Cliffs, NJ:Yourdon Press, 1991.

    5. Davis, A.M. Software Requirements, Objects, Functions, and States. Englewood Cliffs,NJ: Prentice-Hall, 1993.

    6. Honiden, S.; Kotaka, N.; and Kishimoto, Y. Formalizing specification modeling in OOA.IEEE Software, 10,1 (January 1993), 54-66.

    7. Jackson, M.I. Developing Ada programs using the Vienna Development Method (VDM).Software-Practice and Experience, 15,3 (March 1985), 305-318.

    8. Paulish, F.N., and Tichy, W. EDGE: an extendible graph editor. Software-Practice andExperience, 20, SI (June 1990), 63-86.

    9. Proceedings of the Fourth International Workshop on Software Specification andDesign. April 3-4,1987, Monterey, California, IEEE Computer Society, 1993.

    10. Rumbaugh, J.; Blaha, M.; Premerlani, W.; Eddy, F.; and Lorensen, W. Object-OrientedModeling and Design. Englewood Cliffs, NJ: Prentice-Hall, 1991.

    11. Seki, M.; Horai, H.; and Enomoto, H. Software development process from naturallanguage specification. Eleventh International Conference on Software Engineering, May 1989.

    12. Sleator, D.D., and Temperley, D. Parsing English with a link grammar. Carnegie MellonUniversity, Department of Computer Science, March 1992.