Download pdf - Contributions to AI4FM 2013 - Heriot-Watt University · 2013. 8. 1. · Preface This report contains the contributions to the 4th international workshop on Artiﬁcial Intelligence

Contributions to AI4FM 2013the 4th International Workshop on Artificial Intelligence for Formal Methods

by Gudmund Grov, Ewen Maclean & Leo Freitas (Eds.)

Preface

This report contains the contributions to the 4th international workshop on Artificial Intelligence for Formal Methods (AI4FM 2013). This was held as a satellite event of the 4th conference on Interactive Theorem Proving (ITP 2013), in Rennes (France) on July 22dn 2013. Previous AI4FM workshop has been held in Newcastle (2010), Edinburgh (2011) and Schloss Dagstuhl (2012).The workshop consisted of one invited talk: Sharing the Burden of (Dis)proof with Nitpick and Sledgehammer by Jasmin Blanchette, and 9 regular talks. This report contains the abstract for each regular talk:• The social machine of mathematics by Ursula Martin• Verifying the heap: an AI4FM case study by Leo Freitas, Cliff Jones,

Andrius Velykis and Iain Whiteside• Applying machine learning to setting time limits on decision

procedure calls by Zongyan Huang and Lawrence Paulson• Arís: Analogical Reasoning for reuse of Implementation &

Specification by Mihai Pitu, Daniela Grijincu, Peihan Li, Asif Saleem, Rosemary Monahan, Diarmuid O’Donoghue. This abstract was separated into 2 talks:

• Source Code Matching for reuse of Formal Specifications by Daniela Grijincu

• Source Code Retrieval using Case Based Reasoning by Mihai Pitu

• Learning Domain-Specific Guidance for Theory Formation by Jeremy Gow

• Drawing Proof Strategies by Gudmund Grov, Aleks Kissinger, Yuhui Lin, Ewen Maclean and Colin Farquhar

• Analogical Lemma Speculation by Ewen Maclean • Theory Exploration for Interactive Theorem Proving by Moa

Johansson

Gudmund Grov, Ewen Maclean and Leo Freitas (workshop organisers)

P != NP

π

47

= N=

( : N ) :

∃ ∈ · ( , , )

( , ,↼−

) ∧ = ↼−− ( , )

= −→ N( ) ! ∀ , ′ ∈ ·

'= ′ ⇒ ( ( , ( )), ( ′, ( ′))) ∧∀ ∈ · ( + ( )) /∈

:: ⇒

≡ ∧ ∧ ∧ ( )

=::::

::

= +:

( ):: ⇒ ⇒

′ ≡ ′ ∧ ′

! )

! ) ∪ %→

:( ! ) =

:( ! ) =

Applying machine learning to setting time limitson decision procedure calls

Zongyan [email protected]

Lawrence C. [email protected]

The University of Cambridge Computer Laboratory, William Gates Building15 JJ Thomson Avenue, Cambridge CB3 0FD, United Kingdom

Abstract: MetiTarski is an automatic theorem prover which can prove inequalities involv-ing special functions. It combines a resolution theorem prover (Metis) with a set of axiomsand a collection of decision procedures for the theory of real closed fields (RCF). DuringMetiTarski’s proof search, it generates a series of RCF subproblems which are reducedto true or false using a RCF decision procedure. Many of the RCF subproblems do notcontribute towards MetiTarski’s final proof, and time spent deciding their satisfiability iswasted. Setting a time limit on RCF calls reduces the wasted time but too tight a limit couldaffect MetiTarski’s proof. Machine learning has been applied to the problem-dependent se-lection of the best time limit setting for RCF decision procedure calls in the theorem proverMetiTarski.

1 Introduction

MetiTarski [1] is designed to automatically prove universally quantified logical statements involvingspecial functions such as logarithms, sines, cosines, etc. MetiTarski works by eliminating specialfunctions, substituting rational function upper or lower bounds, transforming parts of the probleminto polynomial inequalities, and finally applying a decision procedure for the theory of RCF. Thetheory of RCF concerns (possibly quantified) boolean combinations of polynomial equations andinequalities over the real numbers.

In MetiTarski, RCF decision procedures are used to simplify clauses by deleting literals that areinconsistent with other algebraic facts. A clause is an atomic formula or its negation, which can berepresented as a disjunction of literals. RCF decision procedures are also used to discard redundantclauses that follow algebraically from other clauses [2]. In MetiTarski’s proof search, the RCF teststypically dominate the overall processor time, while the time spent in resolution is much smaller.Only unsatisfiable RCF subproblems contribute to a MetiTarski proof. If the RCF time limit is toolong, some problems may not be solved because the RCF decision procedure wastes too much timetrying to prove the satisfiability of irrelevant RCF subproblems. However, if the RCF time limit istoo short, important clause simplification steps may be missed when literal deletion RCF calls timeout.

Machine Learning [3] uses statistical methods to infer information from training examples, whichit can then apply to new problems. The Support Vector Machine (SVM) is a method used for classi-fication or regression in the field of machine learning. SVMs are widely used because they can dealefficiently with high-dimensional data, and are flexible in modelling diverse sources of data [9].

2 Methodology

Two RCF decision procedures, Z3 [5] and Mathematica [10], were used in the experiment. Machinelearning was applied to select which decision procedure to use and set the best time limit on RCFcalls for a given MetiTarski problem. In each individual run, MetiTarski called Z3 or Mathematicawith various time limits on RCF calls: 0.1s, 1s, 10s, 100s. A time limit of 100 CPU seconds was setfor each proof attempt.

The experiment was done on 826 MetiTarski problems in a variant of the TPTP format. The datawas randomly split into three subsets, with approximately half of the problems placed in a learningset, a quarter put in a validation set used for kernel selection and parameter optimization, and the finalquarter retained in a test set used for judging the effectiveness of the learning. For each MetiTarskiproblem, the best setting to use was determined by running Z3 or Mathematica with various RCF

time limits and choosing the one which gave the fastest overall runtime. There were three main tasksinvolved in the experiment.

The first task was to identify features of the MetiTarski problems. Each MetiTarski problem wascharacterised by a vector of real numbers or features. Each vector of features was associated withlabel +1 (positive examples) or -1 (negative examples), indicating in which of two classes it wasplaced. Taking the setting of Z3 with 1s RCF time limit as an example, a corresponding learning setwas derived with each problem labelled +1 if Z3 with 1s RCF time limit found a proof and was thefastest to do so, or -1 if it failed to find a proof or was not the fastest. The features used for the currentexperiment were of two types. The first type relate to various special functions: log, sin, cos, etc,each feature represents how complex the expressions containing special functions are. For example,the feature value relating to the log function was equal to 0 if log did not appear in the given problem,or equal to 0.5 if log appeared and only applied to a linear expression. All the other occurrences oflog were regarded as complicated and the log feature value was set equal to 1. The other type offeatures are related to the number of variables in the given MetiTarski problem. These features werefurther differentiated as if they were equal to 0, 1, 2 or more. Having more variables in a problemwill generally increase the difficulty of the proof search.

The second task was to select the best kernel function and parameter values. SVM-Light [6] is animplementation of SVMs in C, and was used to do the classification in this work. The accuracy of itsmodel is largely dependent on the selection of the kernel functions and parameter values. In machinelearning, the F1-score [7] is often used to evaluate the performance of the classifier, which reaches itsbest value at 1 and its worst score at 0. A number of different kernel functions as well as variations ofparameter values were tested using the learning and validation sets. Following the completion of thistask, the best kernel function and model parameters for F1-score maximization were selected for eachclassifier. The generated SVM models were applied to the test set to output the margin values [4],where the margin is a measure of the distance of the closest sample to the decision line.

The final task was to combine the models for different RCF time settings and compare the marginvalues produced by their classifiers. The margin values for each classifier were compared to predictwhich setting is the best for each problem. In an ideal case, only one classifier would return a positiveresult for any problem, where selecting a best setting is just a case of observing which classifierreturns a positive result. However, in practice, more than one classifier will return a positive resultfor some problems, while no classifiers may return a positive result for others. Thus, the relativemagnitudes of the classifiers were considered in the experiment. The classifier with most positive (orleast negative) margin was selected.

3 Results

The experiment was done on 826 MetiTarski problems. The data was randomly split into a learningset (414 problems), a validation set (204 problems) and a test set (208 problems). The total number ofproblems proved out of 208 testing problems was used to measure the efficacy of the machine learnedselection process. The learned selection was compared with fixed RCF time setting. The machinelearned algorithm for selection performed better on our benchmark set than any of the individualfixed settings used in isolation. The selection results are given in the following bar chart.

174Machine learned selection

161

Z3 (0.1s RCF time limit)

167

Z3 (1s RCF time limit)

170


169


138

Mathematica (0.1s RCF time limit)

159

Mathematica (1s RCF time limit)

164


163


100 200

Number of problems proved

4 Future work

Future work remains to be undertaken concerning on extension of the heuristic selection, whichallows the choosing of the option settings within the decision procedures. For example, Mathematicahas many configurable options, which gives us many possible heuristics to investigate.

Grant Passmore et al. [8], have shown that through detailed analysis of the RCF subproblems gen-erated during MetiTarskis proof search, specialised variants of RCF decision procedures can greatlyoutperform general-purpose methods on restricted classes of formulas. It would also be interestingto use a machine learning to learn to select the best stored models for assisting the proof of new RCFsubproblems.

Exploring more feature types relevant to the classification processes is also essential for improv-ing the efficacy of the machine learned selection process. After having defined a set of features, it isinstructive to perform feature selection by removing features that do not contribute to the accuracy ofthe classifier. The results of the experiment will be analysed in order to find which measured featuresmake a significant contribution to the learning and classification processes. This analysis will help tofind why some heuristics work better with some types of problems and would also help with heuristicdevelopment.

References

[1] Behzad Akbarpour and Lawrence Paulson. Extending a resolution prover for inequalities onelementary functions. In Logic for Programming, Artificial Intelligence, and Reasoning, pages47–61. Springer, 2007.

[2] Behzad Akbarpour and Lawrence C. Paulson. Metitarski: An automatic theorem prover forreal-valued special functions. J. Autom. Reasoning, 44(3):175–205, 2010.

[3] Ethem Alpaydin. Introduction to machine learning. MIT press, 2004.

[4] Nello Cristianini and John Shawe-Taylor. An introduction to support vector machines and otherkernel-based learning methods. Cambridge university press, 2000.

[5] Leonardo De Moura and Nikolaj Bjørner. Z3: An efficient smt solver. Tools and Algorithms forthe Construction and Analysis of Systems, pages 337–340, 2008.

[6] Thorsten Joachims. Making large-scale svm learning practical. advances in kernel methods-support vector learning, b. schölkopf and c. burges and a. smola, 1999.

[7] Thorsten Joachims. A support vector method for multivariate performance measures. In Pro-ceedings of the 22nd international conference on Machine learning, pages 377–384. ACM,2005.

[8] Grant Passmore, Lawrence Paulson, and Leonardo de Moura. Real algebraic strategies formetitarski proofs. Intelligent Computer Mathematics, pages 358–370, 2012.

[9] Bernhard Schölkopf, Koji Tsuda, and Jean-Philippe Vert. Kernel methods in computationalbiology. MIT press, 2004.

[10] Stephen Wolfram. The MATHEMATICA R� Book, Version 4. Cambridge university press, 1999.

Arís1: Analogical Reasoning for reuse of Implementation &

Specification.

Pitu M., Grijincu D., Li P., Saleem A., Monahan, R. O’Donoghue D.P.

Department of Computer Science, NUI Maynooth, Co. Kildare, Ireland.

Introduction: Formal methods and formal verification of source code has been used extensively in the past few

years to create dependable software systems. However, although formal languages like Spec# or JML are quite

popular, the set of verified implementations remains small. Our work aims to automate some of the steps involved in

writing specifications and their implementations, by reusing existing verified programs i.e. for a given

implementation, we aim to retrieve similar verified code and then reapply the missing specification that accompanies

that code. Similarly, for a given specification, we aim to retrieve code with a similar specification and use its

implementation to generate the missing implementation.

Figure 1: Example: Similar implementations reusing specifications

Reasoning by Analogical Comparison: One of the more successful disciplines in artificial intelligence in

recent years has been Case-Based Reasoning (CBR). While CBR traditionally uses relatively straightforward

“cases”, retrieving similar implementations requires structure rich “cases” and necessitates CBR’s parent discipline

of Analogical Reasoning (AR). Both AR and CBR solve problems not from first principles, but by using old

solutions to solve new problems. We use these problem solving disciplines to assist the reuse of verified programs.

The canonical analogical algorithm is composed of the phases: Representation, Retrieval, Mapping,

Validation and Induction. This paper focuses on Representation, Retrieval and Mapping. Currently in Arís, our

model for verified program reuse, we focus on code retrieval. Representation depicts problems (and solutions) as

static parse trees. We then retrieve similar parse trees for a given problem. Next, we identify the mapping between

the two parse trees in order to generate the analogical inferences – transferring and adapting the retrieved

specification to the given problem. Our parallel work on specifications explores specification matching and

specification reuse. We briefly discuss this work later.

Figure 2: The Case-Base contains Source code implementations and corresponding formal specifications

Code Retrieval: When presented with a problem implementation, Arís beings by retrieving a similar

implementation so as to re-use its specification. Each “case” is a software artefact which varies in granularity from

methods, to classes or full applications. Our current model is implemented to perform retrieval using C# source code

and corresponding Spec# (Leino & Müller, 2009) specifications. Retrieval ranks code artefacts by their similarity

score with a given input artefact.

1 Meaning “again” in the Irish Language

public void Swap(int[] a, int i, int j) {

int tmp = a[i]; a[i] = a[j]; a[j] = tmp;

}

{

...

public void Swap(int[] a, int i, int j)

requires 0

Figure 3: Two matched program graphs

(Mishne & De Rijke, 2004) successfully used conceptual graphs as a basis for source code retrieval. A conceptual graph (Sowa, 1994) is a bipartite, directed and finite graph, in which a node has an associated type (can be either a concept node or a relation node) and a referent value (the content inside the node). We use the concept nodes to model different structures and attributes from the source code (e.g. Loop concept) and relation nodes to show how they relate to one another (e.g. a concept node of type Method may have an ongoing edge to a relation node of type Parameter). The advantage of using this kind of representation is that it offers the possibility of exploring the semantic content of the source code while also analyzing its structural properties using graph-based techniques. In order to create the conceptual graph we first parsed the source code into an Abstract Syntax Tree (using the Microsoft Roslyn API) and then transformed the result into the higher level representation of a conceptual graph. Source code retrieval is performed using distinct semantic and structural characteristics of the source code.

In the semantic retrieval process, we express the meaning of the code through the use of API calls. API calls have

been used as semantic anchors in (McMillan, Grechanik, & Poshyvanyk, 2012) as a successful method of retrieving

similar software applications, because they have precisely defined semantics - unlike names of program variables,

types and words that programmers use in comments (techniques used in the majority of existing source code retrieval

systems). We further rely on the Vector Space Model (Salton, Wong, & Yang, 1975) to represent our documents

(code artefacts) and use API calls as “words” in these documents to support semantic retrieval.

In the structural retrieval process, we focus on source code topology as represented in these conceptual graphs

and by analysing metrics of these representations (for example, number of nodes, number of loops, loop size,

connectivity (O'Donoghue & Crean, 2002)). Based on conceptual graphs, we extract content vectors (Gentner &

Forbus, 1991) which express the structure of the code and the number of concepts (for example, loops, statements,

variable declarations, etc.). Because our database of source code artefacts will potentially be very large, we need an

efficient way of retrieving the most similar cases. We propose using the K-means clustering algorithm in order to

create sub-groups of content vectors and perform K-nearest neighbours only on the “closest” sub-group to a given

input content vector, thus speeding up the structural retrieval process. We then combine semantic retrieval with

structural retrieval and impose different constraints: depending on the type (method, class, application) of the input

artefact (query) - retrieving only artefacts of the same type. The outcome of retrieval then, is a ranked list with a

small number of the most similar artefacts found in the repository.

Code Mapping: This next mapping task is a per-cursor to solution generation, finding detailed

correspondences between the old solution and the new problem. We use the term source to refer to each retrieved

(candidate) solution in turn and the term target to refer to our unspecified problem code. The objective here is to

identify the best mapping between the two isomorphic graphs, where the two programs use different identifier

names. But mapping should also cater for homomorphic graphs, allowing different structures to be mapped together.

For code matching we propose to use an incremental matching algorithm based on the Incremental Analogy Machine

(IAM) (Keane, Ledgeway, & Duff, 1994). Although methods for comparing conceptual graphs have been proposed

before, many of them focus on matching identical graphs or subgraphs (like Sowa’s set of projections and

morphisms) or they rely on various parameters that have to be empirically determined (Mishne & De Rijke used

parameters like concept and relation weights, matching depth etc). IAM begins by matching the two largest sub-trees

within the source and target graphs. This forms a seed mapping and then additional structures from the source and

target are added iteratively forming a single mapping between the new code and the previously specified code.

Mapping Constraints: Specific constraints guide the formation of this mapping. Gentner’s (1982) 1-to-1 constraint ensures that the mapping remains consistent. We also impose further constraints on those concepts that are mapped between the source and target code. We might ensure for example that variables are matched with variables and loops with loops. Isomorphic subgroups that share the same structural properties are then formed based on one-to-one correspondences between the graph concepts. However, we plan to map graphs that don’t share the exact same structure, but that may be related (see for example the two graphs in Figure 3). We are currently exploring with mappings between two non-isomorphic (homomorphic) sub-graphs from the source and target parse trees.

Finding the appropriate “root” concept (the most referenced node in the graph) from which to start the matching process is a challenging aspect of using the IAM algorithm. We will address this issue by assigning node ranks (using a graph metric akin to Page Rank like the one proposed in (Bhattacharya, Iliofotou, Neamtiu, & Faloutsos, 2012)) and see whether this will lead us to conclusive mappings between the source and target domains. The generated mapping is then evaluated to decide whether or not it is near-optimal and the algorithm may choose to backtrack to select alternative seed mappings – or the mapping may be abandoned if sufficient similarity cannot be found.

Inference: Once IAM has found a suitable mapping, it generates the analogical inferences in order to generate the required specifications for the given code. Analogical inferences are generated using a surprisingly simple algorithm for pattern completion called CWSG - Copy With Substitution and Generation (Holyoak et al, 1994). CWSG transfers the additional specifications from the retrieved code and adds it to the target code – substituting source code items with the mapped equivalents. This should allow our target/problem code to be formally verified using the newly generated specification. Optionally, we can also retain the newly formally verified source code artefacts for further use.

Parallel Research on Specification Matching and Reuse: In addition to our work on code reuse we are exploring the reuse of specifications. For a given specification, we aim to retrieve code with a similar specification and use the retrieved implementation to generate the “missing” implementation. The work here has two approaches: the matching of specifications based on the description of the program’s behaviour and the reuse of the same specifications for implementations that differ only in their use of data structure.

The first approach focuses on the specification matching of software components (Zaremski & Wing 1997) using a hierarchy of definitions for precondition and postcondition matching. We apply these definitions to specifications of C# code that are written in Spec# (Barnett et al. 2005) using the underlying static verifier Boogie (Leino et al. 2005) and the SMT solver Z3 (De Moura and Bjørner, 2008) to determine matches and to verify the correctness of our specification implementation pairs. Results to date are promising with a mixture of method specification matches allowing the retrieval and verification of similar specifications and their associated implementations.

Our second approach focuses on reuse of specifications and program verifications. Software clients should only be concerned with specifications and do not need to know details about the implementation and the verification process. This separation of concerns is achieved via data abstraction where we provide an abstract view of a program which we can provide to the client without exposing implementation details. As a result, each specification may have many implementations, each differing in terms of the underlying data structure used in their implementations. The theory of data refinement guarantees that this difference of data structure does not adversely affect the correctness of the programs with respect to their specifications. Our research here explores the reuse of specifications, and the automatic generation of the associated proof obligations, when an implementation is replaced by another implementation that is written in terms of an alternative data structure. We focus on the Dafny language (Leino, 2010), which is closely related to the Spec# language and is verified using the Boogie static verifier. The advantage of using Dafny is that it offers updatable ghost variables which can be used to verify the correctness of a data refinement.

Future work: The basic assumption underlying our work is that similar implementations have similar specifications. Further work is ongoing to ascertain the veracity of this hypothesis – and if true, what degrees of “similarity” are required. At the moment our work is fragmented by slightly different tools and approaches. In the future, we hope to combine these approaches within Arís to achieve a fully integrated platform with analogical reasoning as its core, allowing for reuse of implementations, specifications and their verifications.

References

Barnett M., DeLine R., Jacobs B., Fähndrich M., Rustan K. R. M, Schulte W. and Venter H.,(2005) The Spec#

programming system: Challenges and directions. VSTTE 2005.

Bhattacharya, P, Iliofotou, M, Neamtiu, I, Faloutsos, M. (2012). Graph-Based Analysis and Prediction for Software

Evolution. Intl. Conf. on Software Engineering, pp. 419-429.

Chu S, Cesnik, B. (2001). Knowledge representation and retrieval using conceptual graphs and free text-document

self-organization techniques. Int. Journal of Medical Informatics, 121–133.

De Moura L. and Bjørner N. (2008) Z3: An efficient SMT solver. In Tools and Algorithms for the Construction and

Analysis of Systems, TACAS 2008, ETAPS Proceedings, volume 4963 of Lecture Notes in Computer Science, pages

337-340. Springer, 2008.

Gentner, D., & Forbus, K. (1991). MAC/FAC: A model of similarity-based retrieval. Proc. Cognitive Science

Society.

Holyoak K J, Novick L, Melz E, 1994 “Component Processes in Analogical Transfer: Mapping, Pattern Completion

and Adaptation”, in Analogy, Metaphor and Reminding, Ablex NJ.

Keane, M. T., Ledgeway, T., Duff, S. (1994). Constraints on analogical mapping: A comparison of three models.

Cognitive Science 18, pp. 387-438.

Leino K. R. M. & Müller, P. (2009). Using the Spec# Language, Methodology, and Tools to Write Bug-Free

Programs, Microsoft.

Leino K. R. M., Jacobs B., DeLine R., Chang B. E. & Barnett M. (2005) Boogie: A Modular Reusable Verifier for

Object-Oriented Programs, FMCO 2005

Leino K. R. M. (2010). Dafny: An Automatic Program Verifier for Functional Correctness.

LPAR-16, April 2010.

McMillan, C. Grechanik M. and Poshyvanyk, D. “Detecting similar software applications,” Intl. Conf. on Software

Engineering, pp. 364-374, 2012.

Mishne, G., & De Rijke, M. (2004). Source Code Retrieval using Conceptual Similarity, Computer Assisted

Information Retrieval (RIAO ’04), pp. 539-554.

Monahan, R., O'Donoghue DP Case Based Specifications – reusing specifications, programs and proofs, Dagstuhl

Reports Vol. 2(7) pp 20-21, AI meets Formal Software Development, 2012.

Montes-y-Gomez, M., Lopez, A., & Gelbukh, A. F. (2000). Information retrieval with conceptual graphs, Database

and Expert Systems Applications, 312–321.

O'Donoghue D. and Crean, B. “RADAR: Finding Analogies using Attributes of Structure”, Proc. AICS, Limerick,

Ireland, 2002.

Ounis, I., & Pasca, M. (1998). Modeling, Indexing and Retrieving Images using Conceptual Graphs. Proc. 9th Intl.

Conf. on Database and Expert Systems Applications, Springer.

Salton, G., Wong, A., & Yang, C. (1975). A Vector Space Model for Automatic Indexing. Communications of the

ACM, 18, 613–620.

Sowa, J. F. (1984). Conceptual structures - Information processing in mind and machine.

Zaremski, A. M., Wing J. M. (1997). Specification Matching of Software Components, ACM Transactions on

Software Engineering and Methodology, Vol. 6, No. 4, 333-369.

Learning Domain-Specific Guidance for Theory Formation

Jeremy Gow

⇤

Computational Creativity Group

Goldsmiths, University of London

[email protected]

HR is a theory formation system, which (semi-)automatically develop existing formal the-ories, adding new concepts, conjectures and examples, and which has numerous applicationsin formal methods and beyond [2]. However, obtaining reasonable results in a new domainoften requires the system to be properly configured, either by an expert user or through con-siderable trial-and-error. This can be overwhelming for novices: HR’s GUI has a ”Search”panel with 159 parameters (checkboxes, text fields etc.), with even more features available tothose willing to investigate the source code.

We report here on some initial experiments on getting HR to automatically learn a simpleform of domain-specific guidance, in the context of Event-B lemma generation. Llano et al. [3]showed how HR could be used to generate invariants (conjectures) that allowed Event-B proofobligations to be automatically discharged during the model refinement process. Below, weuse one of her examples to illustrate how information about the utility of existing conjecturescan be exploited to reduce further search in the same domain.

The general mechanism is as follows: whenever a theoretical entity is found to be usefulin a particular domain (e.g. a conjecture discharges a proof obligation), the concepts whichlead to its creation are added to a valued list V . We define D(V ) as the list of theory stepsused to derive the elements of V . For subsequent searches in that domain, HR’s agenda oftheory formation steps can be ordered by i) the maximum complexity of the concepts in eachstep and ii) the number of times the step’s rule appears in D(V ). Hence production rulesfound to successful in a domain are promoted in the search. Following Llano et al., we use abreadth-first exploration of the theory space, though rule orderings could be used with otherstrategies, e.g. HR’s interestingness-based heuristic search. Our approach is implemented in amodified version of HR, and demonstrates a simple form of domain-specific learning in theoryformation, which we hope to develop in future work.

Mondex example

Mondex is a card-based system for handling electronic cash transfers. Butler and Yadavpresent a formal development in Event-B which incrementally refines an abstract specificationof the Mondex system into a detailed design [1]. Each refinement step results in a set of proof

⇤This work was supported by the Dependable System Group at Heriot-Watt University, through EPSRCPlatform Grant EP/J001058/1. Thanks in particular to Teresa Llano, Andrew Ireland and Simon Colton fortheir invaluable advice and encouragement.

1

T1 state(a) ^ ((TRANS(b) ^ idleF (a, b)) _ (TRANS(b) ^ epr(a, b)))! state(a) ^ TRANS(b) ^ idle(a, b)

T2 state(a) ^ TRANS(b) ^ epv(a, b) ^ ((TRANS(b) ^ epa(a, b)) _ (TRANS(b) ^ abortepa(a, b)))! state(a) ^ TRANS(b) ^ pending(a, b)

T3 state(a) ^ TRANS(b) ^ epa(a, b) ^ ((TRANS(b) ^ epv(a, b)) _ (TRANS(b) ^ abortepv(a, b)))! state(a) ^ TRANS(b) ^ pending(a, b)

T4 state(a) ^ TRANS(b) ^ recover(a, b)$ state(a) ^ TRANS(b) ^ abortepv(a, b) ^ abortepa(a, b)

T5 ¬9a9b.state(a) ^ TRANS(b) ^ epv(a, b) ^ abortepv(a, b)T6 ¬9a9b.state(a) ^ TRANS(b) ^ epa(a, b) ^ abortepa(a, b)T7 ¬9a9b.state(a) ^ TRANS(b) ^ idleT (a, b)

^((TRANS(b) ^ epa(a, b)) _ (TRANS(b) ^ abortepa(a, b)))T8 ¬9a9b.state(a) ^ TRANS(b) ^ idleF (a, b) ^ epr(a, b)T9 ¬9a9b.state(a) ^ TRANS(b) ^ idleT (a, b)

^((TRANS(b) ^ epv(a, b)) _ (TRANS(b) ^ abortepv(a, b)))

Table 1: Target conjectures for Mondex Level 4.

obligations: although most can be discharged using automatic tools, a few (17 of 679) requiredhuman guidance. More recently, Llano et al. looked at automating invarient discovery in thisexample (among others), and showed that HR could generate invariants that allowed manyof these outstanding proofs to be fully automated [3]

We illustrate our approach to learning in HR with Level 4 of the Mondex refinement:Table 1 shows the 9 invarients required at this level (Llano, personal communication) —target conjectures that HR needs to generate. Additionally, we added weaker variants of T2,T3, T7 and T9, formed by dropping one half of a disjunction, to make a total of 16 targetconjectures. The target conjectures can all be generated from concepts that are either inthe the specification, or that are derived by one or two applications of compose and disjunctapplication rules.

Depending of which of these conjectures has already been generated and used, our learningmechanism could generate a number of rule orderings. For simplicity, we focus on the orderingcompose > disjunct > other induced by the entire set of conjectures, which requires 8 composesteps and 5 disjunct steps to derive.

We ran our modified version of HR using three configurations, each for 10,000 steps:

Config A The hand-crafted macros used by Llano et al. (Llano, personal communication).Five production rules are enabled: exists, numrelation, compose, disjunct and negate.

Config B Config A with all applicable production rules enabled. Experiments indicated thateight rules could potentially be applied in this theory: the five from config A, plus size,arithmetic and forall.

Config C Config B with the rule ordering compose > disjunct > other enabled.

Config A found 7 of the 9 target conjectures, and generated stronger varients (equivalences)for the remaining two (T2 and T3). Each conjecture required between 60 (T5) and 5524 (T7)

2

T1 T2 T3 T4 T5 T6 T7 T8 T9

Config B (No rule ordering)Config C (Rule ordering)

Step

s / (

Step

s fo

r Con

fig A

)

01

23

45

Figure 1: Theory formation steps for each target conjecture, for “all rules enabled” configs Band C, relative to the steps required for the human-authored 5-rule config A (dashed line).

theory formation steps. Enabling all the production rules (Config B) lead to more than twicethe amount of search for six conjectures, and more than the step limit for the remainingthree. Our rule ordering approach (Config C) demonstrated lower costs than B, allowing all9 targets to be found even though all rules were enabled. Figure 1 shows the search costs forB and C, relative to A. For conjectures requiring only 1-step concepts, B and C both findthe target, with C outperforming A and B, and B requiring about 2-3 times as much searchas A. For more complex conjectures, requiring 2-step concepts, C finds the target at muchhigher cost than A, and B fails to find it within 10,000 steps. In summary, our learned configC outperforms the naive config B, though is ultimately inferior to the hand-crafted config A,which has the advantage of turning some useless rules o↵ completely.

In further work, we hope to test this approach more rigourously — we’re currently workingwith Llano to recreate the rest of the target conjectures that were used (but not documented)in [3]. We also hope to explore more sophisticated forms of domain-specific knowledge in-cluding: patterns of rule applications, e.g. the recurrance of the x^ (y _ z) concept structurein the Mondex examples; learning analogies between theory predicates, e.g. the similar rolesplayed by abortepa and abortepv; and learning measure weights in heuristic search.

References

[1] Michael Butler and Divakar Yadav. An incremental development of the Mondex systemin Event-B. Formal Aspects of Computing, 20(1):61–77, 2008.

[2] Simon Colton. Automated Theory Formation in Pure Mathematics. Springer-Verlag, 2002.

[3] Maria Teresa Llano, Andrew Ireland, and Alison Pease. Discovery of invariants throughautomated theory formation. In Proceedings 15th International Refinement Workshop,pages 1–19, 2011.

3

Drawing Proof Strategies – the PSGraph Language & ToolGudmund Grov, Yuhui Lin & Colin Farquhar (Heriot-Watt University)

Aleks Kissinger (University of Oxford)Ewen Maclean (University of Edinburgh)

Abstract

Complex automated proof strategies are often difficult to analyse, extract, visualise, modify, anddebug. Traditional tactic languages, often based on stack-based goal propagation, make it easy towrite proofs that obscure the flow of goals between tactics and are fragile to minor changes in input,proof structure or changes to tactics themselves. Here, we outline PSGraph [2], a graphical languagefor writing proof strategies. Strategies are constructed visually by “wiring together” collections oftactics and evaluated by propagating goal nodes through the diagram. It is supported by the PSGraphtool. We argue that the combination of both procedural and declarative views of proofs, making itmore amendable to analysis and generalisations (across similar proofs) – and briefly outlines sometechniques for generalisations. More details of the work described here can be found in [2, 3, 1]

1 Introduction

Most tactic languages for interactive theorem provers have no means to distinguish goals. Thus whencomposing tactics, we have no choice but to rely on the order in which goals arrive, thus making thembrittle to minor changes. For example, consider a case where we expect three outputs of tactic t1, wherethe first two are sent to t2 and the last to t3. A small improvement of t1 may result in only two sub-goals.This “improvement” causes t2 to be applied to the second goal when it should have been t3. The tactic t2may then fail or create unexpected new sub-goals that cause some later tactic to fail.

As a result: (1) it is often difficult to compose tactics in such a way that all sub-goals are sent tothe correct target tactic, especially when different goals should be handled differently; (2) when a largetactic fails, it is hard to analyse where the failure occurred; and (3) the reliance of goal order meansthat learning new tactics from existing proofs have not been as successful as it has been for discoveringrelevant hypothesis in automated theorem provers.

Here, we introduce the PSGraph language for representing proof strategies [2]. We argue that thislanguage has three advantages over more traditional tactic languages:

(i) to improve robustness of proof strategies with static goal typing and type-safe tactic

“wirings”; (ii) to improve the ability to dynamically inspect, analyse, and modify strategies,

especially when things go wrong; and (iii) to enable learning of new tactics from proofs.

We then introduce the PSGraph tool which implements the language. The implementation is genericacross theorem provers, but here we focus on the instantiation in Isabelle. Details of this work can befound in [2]. At the end we briefly discuss one particular use of the language – to generalise proofs intoproof strategies [3].

2 The PSGraph Language

Traditional tactics are essentially untyped: they are functions from a goal to a conjunction of possiblesub-goals. In many programming languages, types are used statically to rule out many “obvious” errors.For example, in (typed) functional languages, a type error will occur when one tries to compose twofunctions which do not have a unifiable type. In an untyped tactic language, this kind of “round-peg-square-hole” situation will not manifest until run-time.

1

Drawing Proof Strategies Grov, Kissinger, Maclean, Lin & Farquhar

For errors that cannot be found statically, it is very hard to inspect and analyse the failures duringdebugging. In the above example, if t2 creates sub-goals that tactics later in the proof do not expect,the error may be reported in a completely different place. Without a clear handle on the flow of goalsthrough the proof, finding the real source of the error could be very difficult indeed.

To make proofs more robust, structured proofs show each intermediate goal of a proof, and are thusmore robust for changes in the sense one can see exactly at which point a failed proof breaks down.However, for two proofs in the same “family” the goals will vary, thus a proof strategy cannot includethe exact expected goal.

To overcome this, PSGraph has a concept of the goal types. A goal type is essentially a predicateon a (sub-)goal, and this (sub-)goal is of the goal type if the associated predicate holds. The underlyingtheory is generic with respect to the goal type used, and we have currently developed two goal types:

• in [2] a simple goaltype is used, where the user has to provide the implementation of the prediate.A goaltype there is simple a conjunction of such user-provided predicates.

• in [3] a more complex goal type is used. This is used to highlight relevant hypothesis, and enablemuch more static analysis. This is more suitable for analysis, proof extraction and generalisations,but may be too complex when drawing strategies by hand.

PSGraph is defined using the mathematical formalism of string diagrams. A string diagram consistsof boxes and wires, where the wires are used to connect the boxes. Both boxes and wires can containdata, and data on the edges provides a type-safe mechanism of composing two graphs. Crucially, stringdiagrams allow dangling edges. If such edge has no source, then this becomes and input for the graph,and dually, if an edge as no destination then it is the output of the graph.

The edges in a PSGraph contains a goal type, while a box could either be a tactic or a (sub-)goal.There are two types of tactics. The first type is known as a graph tactic, which is simple a node holdingone or more graphs which be unfolded. This is used to introduce hierarchies to enhance readability. Asecond usage, in the case it holds more than one child graph, is to represent branching in the searchspace, as there are multiple ways of unfolding such boxes. The other type of tactic is an atomic tactic.This corresponds to a tactic of the underlying theorem prover.

Evaluation is achieved by sending goals down wires, and it evaluation is completed when all thegoals are on the output edges of a wire. For this, the (sub-)goal node is used. One step of evaluation fora tactic (box) t, with an input wire containing a goal node g, works as follows: t is applied to g and gis removed from the graph. Each sub-goal produces is then matched to each output edge. If it succeedsthen it is added to the graph. Note that if a sub-goal does not match any of the output edges then the stepfails, and if it matches multiple outputs, then this will introduce branching in the search space (one foreach match).

3 The PSGraph Tool

The PSGraph language has been implemented in the PSGraph tool1. It is implemented on top of Quan-tomatic2, and supports both Isabelle and, to some extent ProofPower, as well as both the goal typesdiscussed above. Figure 1 illustrates one usage of the tool – as an Isabelle method. Here, the ipsgraph(interactive psgraph) method is called from the Isabelle/PIDE interface. This communicates with thegraphical interface of PSGraph, which enables one to step through the proof. There is also an auto-mated version of this method called psgraph. [1] discusses generation of structured proof scripts inIsabelle/Isar from an application of a psgraph method.

1See https://github.com/ggrov/psgraph/.2See https://sites.google.com/site/quantomatic/.

2

Drawing Proof Strategies Grov, Kissinger, Maclean, Lin & Farquhar

Figure 1: Screenshot of the PSGraph tool.

4 The use of AI for Generalising PSGraph Strategies

One of our key motivation for this language is to support extraction of proof strategies from exemplarproofs, and this is where the AI comes into play. In [3] we have made a start for this by developingmethods for in particular generalising goal types and tactic boxes as well as loop discovery. The latteris particularly interesting, as we achieve this by a combination of goal type generalisation, tactic gen-eralisation and graph rewriting. An important property of this is that we can (statically) discover thetermination condition for a loop. Developing further techniques for such generalisations is the PhD topicfor one of the authors (Farquhar).

In the future we will also integrate the work discussed here, with two other papers submitted to theAI4FM 2013 workshop: (i) ‘Analogical Lemma Speculation’ (Maclean) which we believe can be appliedto discover related lemmas in case of cut-rule we were not able to generalise properly; and (ii) ‘Verifyingthe heap: an AI4FM case study’ (Whiteside et al), where we plan to represent the captured strategies inPSGraph, and utilise the proof process discussed there to guide generalisations.

Acknowledgements. Thanks to member of the AI4FM project, Lucas Dixon and Alex Merry. This workhas been supported by EPSRC grants: EP/H023852, EP/H024204 and EP/J001058, the John TempletonFoundation, and the Office of Navel Research.

References

[1] Colin Farquhar and Gudmund Grov. Structured proofs from a graphical proof strategy language. In Proc. ofARW 2013.

[2] Gudmund Grov and Alexs Kissinger. A graphical language for proof strategies. arXiv preprintarXiv:1302.6890, 2013.

[3] Gudmund Grov and Ewen Maclean. Towards automated proof strategy generalisation. arXiv preprintarXiv:1303.2975, 2013.

3

Analogical Lemma Speculation

Ewen Maclean

July 18, 2013

1 Motivation

One of the main ideas that underlies the work discussed here is that the strategy for completing a proofcan be explicit or implicit:

Explicit strategies are those where each step of the proof is prescribed exactly.

Implicit strategies refer to those where the power of the prover and its search strategy means that thechoice of rules to apply determines the success or failure of a proof attempt.

Even in the implicit case, some form of strategy is employed but the precise order of application of rulesis not of interest.

This paper concerns the implicit case, where we are concerned more with trying to find the right rulesto allow the proof to succeed. We present an idea for lemma generation which combines the notion ofanalogy and theory formation.

The motivation for lemma generation via analogy comes from the main hypothesis of the AI4FMproject, combined with an as yet unresolved problem with combinatorial explosion in the generation oflemmas.

In the AI4FM project, the hypothesis is stated as:

we believe that it is possible to (devise a high-level strategy language for proofs and)extract strategies from successful proofs that will facilitate automatic proofs of related proofobligations.

Currently there is an implementation of a “Strategy Language” in Isabelle, which allows proofs to beannotated with “features” and for strategies to be extracted from example proofs [1]. The idea is that theextracted strategy should be able to be “replayed” on other proofs.

The intuition is that during a mechanised proof of a large problem, which involves many smaller proofs,if an expert shows how one example can be proved, a strategy can be extracted from this which accountsfor a good percentage of the remaining proofs from the problem.

In this note we translate this idea to the notion of implicit strategies by attempting to discover lemmasby analogy. We attempt to construct lemmas from one template example.

2 Terminology

We work within the context of an Automated Theorem Prover, where a current theorem is being studiesand a proof attempted. In the case where it is blocked, the system searches for a similar “analogous”theorem and gleans information from its proof. We use the following terminology:

Target Theorem This is the theorem which is to be proved.

Source Theorem This is a similar theorem which has been proved and acts as an analogous example tothe Target Theorem.

Source Lemma This is a user-defined lemma which was employed in the proof of the Source Theorem.

Target Lemma This is a lemma which is to be discovered and should be analogous to the Source Lemma.

For this presentation we do not describe in detail the way in which the Source Lemma and SourceTheorem are chosen. The work follows the ideas of [3], which provides a way to determine similaritybetween proofs. The lemmas we prove in this exposition are all non-conditional, which is a restriction ofthe current system. Dealing with conditionals is further work.

1

⇤ �! ⇤fact �! fibqfact �! qfib

Table 1: Equivalences between functions of the Source and Target Lemmas

3 Lemma Analogy

In order to elucidate our approach we use a running example taken from the work on formalising the JavaVirtual Machine in ACL2 [2]. The examples given here are provided by a tool written specifically forACL2 which incorporates Statistical Machine Learning. For example, we take our Source Theorem to be

fact(n) = qfact(n, 1)

where fact is a primitively recursive factorial function, and qfact is a tail recursive factorial function wherethe accumulator argument is here set to 1. The proof of this theorem employs a user-defined generalisationlemma – the Source Lemma:

fact(n) ⇤ a = qfact(n, a)

Now we are faced with trying to prove the Target Theorem:

qfib(n, 0, 1) = fib(n)

where fib(n) gives the nth number of the Fibonacci sequence, and qfib is a tail recursive version with twoaccumulator variables set to 0 and 1.

We omit the definitions of the defined functions for space reasons as they are standard in the literature.The Source Lemma is the key to proving the Source Theorem – it is a typical generalisation used in provingequivalences between tail recursive and primitively recursive functions. The goal is to e�ciently generatethe analogical lemma for the Target Theorem.

3.1 Function symbols

Firstly we identify the common symbols to the theory of naturals - these are

+ ⇤ � 1 0

then we attempt to create equivalences between the Source and Target Theorem symbols. Here we focuson the equivalence: A set of tables of equivalences such as these are generated in order to constructthe analogical lemma. Common functions symbols are constrained to be equivalent to common functionsymbols, so ⇤ cannot be replaced with fib for example. Table 1 shows just one possible equivalence.

3.2 Tree recomposition

The term tree of the Source Lemma is decomposed and recomposed with the function symbols from theequivalence table shown in 1. Where the arity is di↵erent, a new variable is inserted at each possiblelocation of the correct type (in this case just natural numbers), and all permutations are generated.Where the lemma is an equality with one function symbol on one side, no permutation is calculated asthis produces duplicates in alpha-equivalences. The tree recomposition shown in the left side of Figure 3.3shows a first attempt at tree recomposition. Each candidate lemma is passed through a simple counter-example checker which attribute values to each variable and checks if the lemma is invalid. In the case ofthe lemma on the left hand side of Figure 3.3:

fib(n) ⇤ a = qfib(n,m, a)

the candidate lemma can immediately be ruled out as a candidate due to the existence of a single variablewhich does not appear on the other side of the equality. If no potentially valid candidate lemma is foundthen a more complicated tree recomposition (called mutation) takes place.

2

3.3 Tree mutation

If no generated lemma passes the testing phase, then a secondary, more involved process takes place wherethe tree is “mutated” by complicating the tree to add more structure to the nodes. The process expandseach leaf node of the tree where a variable is not found, and then in a final phase expands the top nodeto introduce extra structure to account for theorems such as common distributivity laws. Such a treereconstruction is shown in the right side of Figure 3.3. This leads to the correct lemma to be used:

qfib(n,m, a) = fib(n) ⇤ a+ fib(n� 1) ⇤m

Figure 1: A correct and incorrect recomposition of the Source Lemma using symbols from the TargetTheorem

4 Results

We have tested the system on a set of theorems from the ACL2 implementation of the Java Virtual Ma-

chine:qfact(n, 1) = fact(n)

qpower(n, 1) = power(n)

qexpt(n,m, 1) = expt(n,m)

qmul(n,m, 1) = mul(n,m)

qsumsquare(n, 0) = sumsquare(n)

qsum(n, 0) = sum(n)

qfib(n, 0, 1) = fib(n)By choosing all possible Source and Target combinations, the correct lemmas are suggested immediately

except in the case where the Source Lemma has a deeper term tree than the Target Lemma. This is afailing of the system. However, the process has discovered some interesting results, which are unexpected,such as

qfact(n,m) = (fact(n� 1) ⇤ n) + (fact(n) ⇤ (m� 1))qpower(n,m) = (power(n� 1) ⇤m) + (power(n� 1) ⇤m)

where power(n) denotes 2n and qpower is a tail-recursive version.

References

[1] Gudmund Grov and Ewen Maclean. Towards automated proof strategy generalisation. CoRR,abs/1303.2975, 2013.

[2] M. Kaufmann, P. Manolios, and J S. Moore. Computer-Aided Reasoning: An Approach. KluwerAcademic Publishers, 2000.

[3] E. Komendantskaya, J. Heras, and G. Grov. Machine Learning for Proof General: interfacing interfaces.Electronic Proceedings in Theoretical Computer Science, 118:15–41, 2013.

3

Theory Exploration for Interactive Theorem Proving

Moa Johansson

Chalmers University of Technology

Abstract

Theory exploration is an automated reasoning technique for discovering and proving interesting prop-erties about some set of given functions, constants and datatypes. In this note we describe ongoing workon integrating the HipSpec theory exploration system with the interactive prover Isabelle. We believethat such an integration would be beneficial for several reasons. In an interactive proof attempt a naturalapplication would be to allow the user to ask for some suggestions of new lemmas that might help thecurrent proof development. Theory exploration may also be used to automatically generate and provesome basic lemmas as a first step in a new theory development. Furthermore, when the theory explorationsystem is used as a stand-alone system, it should output a checkable proofs, for instance for Isabelle, sothat sessions can be saved for future use.

1 Introduction

Theory exploration is an automated reasoning technique for discovering and proving interesting propertiesabout some set of given functions, constants and datatypes. In the context of theorem proving, theoryexploration systems are typically used to generate candidate conjectures satisfying some intrestingnesscriteria, using automated testing or counter-example finding to produce candidates likely to be theorems.

Many theory formation systems, such as MATHsAiD [6], IsaCoSy [5] and IsaScheme [7] are automaticand attempt to produce a background theory with ‘interesting’ lemmas which are likely to be useful toin later proof attempts. However, while both IsaCoSy and IsaScheme are built on top of the interactiveprover Isabelle [8], neither can be called from an interactive proof attempt. This problem is due tolong runtimes, IsaCoSy and IsaScheme are generally to slow to wait for in an interactive proof attempt.We have recently developed a new theory exploration system called HipSpec, which discovers and provesproperties about Haskell programs by induction [2]. HipSpec is considerably faster than previous systems,and thus a candidate for integration in an interactive system.

Large theories with many functions and datatypes is a major challenge for automated theory explo-ration systems. To deal with theories with many hundreds, or even thousands, of functions requiressome form of modular exploration and division into smaller sub-theories. In the interactive scenario wecan to some extent circumvent this problem. Where the user asks the system for suggestions of lemmasapplicable to the current subgoal, the input to the theory exploration system can be directly specified bythe user or limited to functions relevant to that subgoal.

In this note, we describe a new project aiming at integrating theory exploration and interactivetheorem provers:

• Theory exploration has so far mainly been used in conjunction with automated theorem provers.Many proof developments are however too complex to be fully automated. By integrating theoryexploration with an interactive theorem prover we make theory exploration available to a wideraudience.

• We aim to develop a tool integrated with the interactive proof assistant Isabelle which allow a userwho is stuck in a proof attempt to ask for some candidate lemmas which could help the currentproof attempt. It is crucial that the system responds quickly and does not produce too many (ortoo few) candidates.

• We would like theory exploration systems such as HipSpec to produce a sensible output format,so that proofs can be independently checked and lemmas recorded in a library. Producing Isabelletheory files is one option.

1

2 Background: The HipSpec System

HipSpec is an inductive theorem prover and theory exploration system for discovering and proving proper-ties about Haskell programs. HipSpec takes a Haskell program annotated with properties which the userwish to prove and first generate and prove a set of equational theorems about the functions and datatypespresent. These are added to its background theory and user specified properties are then proved in thisricher theory. Experimental results have been encouraging, HipSpec performs very well and proved moretheorems automatically than other state-of-the-art inductive theorem provers [2].

HaskellSource

Conjectures

First OrderTheory

TheoremProver

Induction (Hip)

QuickSpec

Translation(Hip)

Timeout

Open conjecture

Theorem

Extend theory

Figure 1: Architecture of the HipSpec system.

HipSpec combines the inductive prover Hip [10] with the QuickSpec system [3]. An overview ofHipSpec is shown in Figure 1. HipSpec first translates the function definitions and datatypes from theHaskell program into first-order logic to form an initial theory about the program. QuickSpec also readsin the available function symbols and datatypes and proceeds to generate new terms from these. Usingthe automated testing framework from the QuickCheck system [1], QuickSpec divides these terms intoequivalence classes, from which equational conjectures are extracted. These conjectures are passed backto Hip, which applies a a suitable induction scheme and feeds the resulting proof-obligations, along withthe first-order background theory, to an automated first-order prover, for instance Z3 [4]. If the proofsucceeds, the new theorem is added to the background theory, and may be used as a lemma in subsequentproof attempts. If a proof fails, HipSpec tries other conjectures and may return to open ones later, whenmore lemmas have been added to the background theory.

3 Integrating HipSpec and Isabelle

Isabelle’s higher-order logic is essentially a (terminating and finite) subset of the functional programminglanguage Haskell, and Isabelle’s code generator can translate Isabelle theories into Haskell programs.HipSpec then needs to monomorphise any polymorphic types in order to translate it to first-order logic,after which it can generate lemmas about the functions corresponding to the Isabelle theory. A very earlyprototype allowing HipSpec to be called from Isabelle has been implemented, but a lot of implementationalwork remain, in particular to import lemmas discovered back into Isabelle.

HipSpec does not produce proofs, as it relies on external provers as black boxes. We have howeverexperimented with using Z3’s capability to produce unsatisfiable cores in order to report back whichlemmas were used in a proof. Using this information, we plan to experiment with a lightweight inductivetactic for Isabelle, which should apply induction and simplification using the relevant lemmas, thusverifying the proof in Isabelle as an added soundness check. Given the right lemmas, we expect such asimple induction tactic would succeed in proving many of the conjectures which HipSpec has discovered.This is similar to how the Sledgehammer system works [9], it sends a conejcture from Isabelle to externalprovers and then replays the proof using Isabelle’s internal prover Metis. Once we have an Isabelle tacticfor HipSpec we can use it in several ways:

Automated induction: HipSpec takes the current goal and applies theory exploration and induction.If it succeeds, it reports which induction scheme and lemmas were used to its corresponding tactic, whichreplays the proof in Isabelle. If any new lemmas were found, information to replay their proofs shouldalso be produced.

2

Generate background lemmas: The user specifies a set of Isabelle functions and datatypes whichare of interest. HipSpec applies theory exploration to these and generate a set of interesting lemmas alongwith instructions to verify the proofs inside Isabelle. This could be done as a first step in a new theorydevelopment, to automatically generate many basic lemmas before the user tackle more complicatedtheorems. Alternatively, the user may call theory exploration to help in an ongoing proof-attempt wherethe user is stuck. In this case, theory exploration is even more restricted, it should only generate lemmaswhich really do apply to the goal, anything else is uninteresting. The user should be presented with ashort list of options to choose from.

HipSpec use random testing to generate conjectures, hence all conjectures have passed a large numberof tests before being submitted to the prover. In an interactive setting, even conjectures HipSpec hasfailed to prove automatically can be of interest, with the user interactively supplying the proof, shouldshe/he find the conjecture is useful.

Produce Isabelle theories about Haskell programs: Ultimately we would like HipSpec tobe useful for verifying real Haskell programs. As mentioned, HipSpec currently use a black-box externalprover and does not output any proofs. For verification purposes we may want some additional reassur-ance. Therefore, we suggest that HipSpec should have the option to output its results Isabelle theories.The user can then inspect the Isabelle theory, and recheck the proofs independently if required. Lemmasabout libraries can also more easily be imported if required. Translating from Haskell to Isabelle is how-ever not as straight forward as the other way around, Haskell is a lazy language and also support infinitedatatypes and non-terminating functions, which Isabelle/HOL does not. Hence, in the first instance wewill be limited to the finite and terminating subset of Haskell.

4 Summary

We believe that theory exploration would be very useful also in an interactive setting. It could assist theuser with suggestions of lemmas relevant to the current proof attempt, or simply to generate basic lemmasin a new theory. By letting the user specify the functions and datatypes passed to theory exploration,we can control the search space and explore larger and more complex theories that are currently beyondthe scope of the fully automatic version. We are currently working on integrating the theory explorationsystem HipSpec with the interactive prover Isabelle.

References

[1] K. Claessen and J. Hughes. QuickCheck: a lightweight tool for random testing of Haskell programs.In Proceedings of the fifth ACM SIGPLAN international conference on Functional programming,ICFP ’00, pages 268–279, New York, NY, USA, 2000. ACM.

[2] K. Claessen, M. Johansson, D. Rosén, and N. Smallbone. Automating inductive proofs using theoryexploration. In Proceedings of the Conference on Auomtated Deduction (CADE), 2013.

[3] K. Claessen, N. Smallbone, and J. Hughes. QuickSpec: guessing formal specifications using testing.In Proceedings of the 4th international conference on Tests and proofs, TAP’10, pages 6–21, Berlin,Heidelberg, 2010. Springer-Verlag.

[4] L. De Moura and N. Bjørner. Z3: an e�cient SMT solver. In Proceedings of TACAS,TACAS’08/ETAPS’08, pages 337–340. Springer-Verlag, 2008.

[5] M. Johansson, L. Dixon, and A. Bundy. Conjecture synthesis for inductive theories. Journal ofAutomated Reasoning, 47(3):251–289, 2011.

[6] R. L. McCasland, A. Bundy, and P. F. Smith. Ascertaining mathematical theorems. Electron. NotesTheor. Comput. Sci., 151(1):21–38, Mar. 2006.

[7] O. Montano-Rivas, R. McCasland, L. Dixon, and A. Bundy. Scheme-based theorem discovery andconcept invention. Expert Systems with Applications, 39(2):1637–1646, Feb. 2012.

[8] T. Nipkow, L. C. Paulson, and M. Wenzel. Isabelle/HOL — A Proof Assistant for Higher-OrderLogic, volume 2283 of LNCS. Springer, 2002.

[9] L. Paulson and J. Blanchette. Three years of experience with Sledgehammer, a practical link betweenautomation and interactive theorem provers. In Proceedings of IWIL-2010, 2010.

[10] D. Rosén. Proving equational Haskell properties using automated theorem provers. Master’s thesis,University of Gothenburgh, 2012.

3