125
Universit`a degli Studi di Trieste Facolt`a di Scienze Matematiche, Fisiche e Naturali Tesi di laurea specialistica in MATEMATICA Numerical errors in the VBL (Virtual Biophysics Lab) simulation program: Newton’s method solution of high-dimensional systems of nonlinear equations from implicit integration steps. Errori numerici nel programma di simulazione VBL (laboratorio vir- tuale di biofisica): soluzione tramite il metodo di Newton per sistemi di equazioni nonlineari di grandi dimensioni derivanti da passi di inte- grazione implicita. Laureanda Relatore Nicole Cusimano Prof. Edoardo Milotti Co-relatore Prof. Marino Zennaro Anno Accademico 2009–2010

Numerical errors in the VBL (Virtual Biophysics Lab ... ad “allietarme” col basso elettrico a tutto volume dal pian de sora e canta a squarciagola “Through the fire and flames”,

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

  • Università degli Studi di Trieste

    Facoltà di Scienze Matematiche, Fisiche e Naturali

    Tesi di laurea specialistica inMATEMATICA

    Numerical errorsin the VBL (Virtual Biophysics Lab)

    simulation program:

    Newton’s method solution

    of high-dimensional systems of nonlinear equationsfrom implicit integration steps.

    Errori numerici nel programma di simulazione VBL (laboratorio vir-

    tuale di biofisica): soluzione tramite il metodo di Newton per sistemi

    di equazioni nonlineari di grandi dimensioni derivanti da passi di inte-

    grazione implicita.

    Laureanda RelatoreNicole Cusimano Prof. Edoardo Milotti

    Co-relatoreProf. Marino Zennaro

    Anno Accademico 2009–2010

  • Ringraziamenti

    Il primo ringraziamento, sentito e dovuto, va al prof. Edoardo Milottiche, con il suo enorme entusiasmo e la sua esperienza, mi ha coinvolto nelsuo progetto e mi ha sempre dato il giusto stimolo per continuare questolavoro. Destreggiandosi con estrema abilità nei suoi mille e uno impegniaccademici (e non), ha sempre trovato il modo di dedicarmi parte del suotempo per insegnarmi qualcosa di nuovo o semplicemente per una buonaparola.

    Un grazie davvero sincero va poi al prof. Marino Zennaro che ha accettatodi fare da “supervisore” della parte matematica della tesi. Con gran curae attenzione ha sempre letto e corretto con tempestività il mio materiale,aiutandomi particolarmente nella fase di simulazione e di interpretazione deirisultati.

    Grazie anche al prof. Igor Moret che mi ha dato utili consigli e validiriferimenti bibliografici per quanto riguarda alcuni dei metodi numerici dame studiati per questa tesi.

    Ora qualcosa di meno formale.This it is the time of a really special person who helped me a lot with

    the writing... I hope she will enjoy these “special thanks” in English (evenif probably full of grammatical and syntactical mistakes!). I want to thankher very much for her help and the patience she had with me. Thanks a lotauntie Deisa! You’ve said that, with this thesis, you’ve come back to thepast but, unfortunately, you don’t have the same dexterity with translations.Doesn’t matter, for me you still do a great job!!! A big big kiss!

    Molto meno formale...Des toca ringraziar quei quattro che, ben o mal, i me ga dovù sopportar

    ogni giorno, dall’inizio alla fine de sta tesi e non solo...Al mio Davidone che vien spesso e volentieri a trovarme quando studio

    e me disi “Ciao, come xè?” anche se so che quel che el vol xè veder lepuntate de The big bang theory dal mio computer, ad Andry che continuaad “allietarme” col basso elettrico a tutto volume dal pian de sora e cantaa squarciagola “Through the fire and flames”, al Papà che me ciama “lasua Poochie” e me fa sempre vignir el sorriso (anche se el suo repertorio debattudine lo conosi tutti a memoria), e alla Mami che me ga sempre assist̀ıcon i suoi “stuzzicanti breaks” per darme la carica nello studio, grazie,grazie, grazie, grazie! Xè anche merito vostro se son rivada fin qua.

  • Contents

    1 Introduction 1

    2 Background material in numerical analysis 52.1 Numerical solution of differential equations . . . . . . . . . . . 5

    2.1.1 Euler’s methods . . . . . . . . . . . . . . . . . . . . . . 72.2 Runge-Kutta methods . . . . . . . . . . . . . . . . . . . . . . 9

    2.2.1 Methods with step-size control . . . . . . . . . . . . . . 102.3 Numerical tests . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    2.3.1 The stiff case . . . . . . . . . . . . . . . . . . . . . . . 212.4 Iterative methods . . . . . . . . . . . . . . . . . . . . . . . . . 27

    2.4.1 Iteration matrices and convergence . . . . . . . . . . . 28

    3 Newton’s method 313.1 Newton’s sequence . . . . . . . . . . . . . . . . . . . . . . . . 32

    3.1.1 Local convergence theory . . . . . . . . . . . . . . . . . 333.1.2 Termination of the iteration . . . . . . . . . . . . . . . 34

    3.2 Approximations to the Jacobian . . . . . . . . . . . . . . . . . 353.3 Errors in the function and the derivative . . . . . . . . . . . . 36

    3.3.1 The modified Newton method . . . . . . . . . . . . . . 363.4 Newton iteration error . . . . . . . . . . . . . . . . . . . . . . 37

    3.4.1 Liniger’s pioneering work . . . . . . . . . . . . . . . . . 383.4.2 Analogous estimates by Sugiura and Torii . . . . . . . 40

    3.5 Spijker’s criticism on the stiff case . . . . . . . . . . . . . . . . 423.5.1 Spijker’s paper . . . . . . . . . . . . . . . . . . . . . . 433.5.2 Local and global stopping error . . . . . . . . . . . . . 45

    3.6 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

    4 Krylov subspace methods 514.1 Projection methods . . . . . . . . . . . . . . . . . . . . . . . . 51

    4.1.1 Two optimality results . . . . . . . . . . . . . . . . . . 524.2 Krylov subspaces . . . . . . . . . . . . . . . . . . . . . . . . . 53

  • 4.2.1 Arnoldi’s method . . . . . . . . . . . . . . . . . . . . . 534.2.2 Arnoldi’s method for linear systems . . . . . . . . . . . 55

    4.3 GMRES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574.3.1 Practical implementation issues . . . . . . . . . . . . . 584.3.2 Restarted GMRES . . . . . . . . . . . . . . . . . . . . 624.3.3 Convergence of GMRES . . . . . . . . . . . . . . . . . 63

    4.4 Preconditioning . . . . . . . . . . . . . . . . . . . . . . . . . . 64

    5 Jacobian-free Newton-Krylov methods 675.1 Inexact Newton methods . . . . . . . . . . . . . . . . . . . . . 685.2 Jacobian-vector product approximation . . . . . . . . . . . . . 70

    5.2.1 How to choose ν . . . . . . . . . . . . . . . . . . . . . 715.3 Newton-Arnoldi and Newton-GMRES . . . . . . . . . . . . . . 725.4 Preconditioned JFNK . . . . . . . . . . . . . . . . . . . . . . . 74

    6 Broyden’s method 776.0.1 Convergence theory . . . . . . . . . . . . . . . . . . . . 78

    6.1 Implementation of Broyden’s method . . . . . . . . . . . . . . 79

    7 Applications 837.1 VBL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

    7.1.1 Structure of the simulation program . . . . . . . . . . . 847.1.2 The structure of the equations . . . . . . . . . . . . . . 86

    7.2 Toy model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 887.2.1 Numerical parameter values . . . . . . . . . . . . . . . 89

    7.3 Linear consumption . . . . . . . . . . . . . . . . . . . . . . . . 897.3.1 Numerical simulation . . . . . . . . . . . . . . . . . . . 93

    7.4 Michaelis-Menten’s metabolic function . . . . . . . . . . . . . 957.4.1 Numerical simulation . . . . . . . . . . . . . . . . . . . 96

    7.5 Reaction-diffusion of two substances . . . . . . . . . . . . . . . 1017.5.1 Numerical simulation . . . . . . . . . . . . . . . . . . . 101

    8 Appendix 1058.1 A: Reaction kinetics . . . . . . . . . . . . . . . . . . . . . . . 105

    8.1.1 Michaelis-Menten equation . . . . . . . . . . . . . . . . 1058.1.2 The equation in the common approach . . . . . . . . . 107

    8.2 B: Matrices and properties . . . . . . . . . . . . . . . . . . . . 1088.3 C: Some Octave codes . . . . . . . . . . . . . . . . . . . . . . 110

  • Introduzione

    VBL (Virtual Biophysics Lab) è un progetto computazionale che ha comeobiettivo lo sviluppo di un modello numerico per la simulazione dell’evolu-zione di sferoidi tumorali dalla nascita al momento in cui lo sferoide raggiungeca 1 mm di diametro.

    La parte della simulazione di cui mi occupo in questa tesi è quella rela-tiva alla diffusione e all’assorbimento e/o produzione di molecole nel sistemadi cellule. Questa parte è deterministica e viene descritta da un complessosistema di equazioni differenziali. In tale parte della simulazione il numero dicellule è fissato (questo numero viene successivamente modificato nella partedi eventi cellulari) e le interazioni di contatto tra le cellule vengono defi-nite da un modulo del programma che calcola di volta in volta la topologiadell’insieme di cellule. Tale modulo viene eseguito durante l’inizializzazione epoi ripetutamente durante il loop principale del programma. Nel modello at-tualmente implementato nel programma di simulazione, ci sono 25 equazionidifferenziali per cellula. Si tratta di equazioni differenziali nonlineari accop-piate tra di loro e con quelle relative alle cellule adiacenti.

    Con l’avanzamento della simulazione, varia il numero di cellule presentinello sferoide e, quando tale sferoide raggiunge ca 1 mm di diametro, le cel-lule in esso presenti sono ca un milione. Per procedere, quindi, il simulatoredeve integrare un numero crescente di equazioni differenziali nonlineari ac-coppiate, arrivando a dover risolvere sistemi di qualche decina di milioni diequazioni.

    A causa della proliferazione e della morte cellulare, il numero di equazionida integrare cambia da passo a passo. Inoltre i movimenti delle cellule por-tano a relazioni di prossimità che sono anch’esse continuamente mutevoli:questi cambiamenti nella topologia del sistema producono cambiamenti con-tinui nella struttura delle equazioni. Per questi motivi gli algoritmi utilizzatirichiedono un’analisi ad hoc delle proprietà di stabilità, convergenza e preci-sione.

    Attualmente l’integrazione del sistema differenziale viene effettuata trami-

  • te il metodo di Eulero implicito. Ad ogni passo temporale (consideratocostante in tale simulazione), l’equazione nonlineare derivante dall’uso diEulero implicito viene risolta in maniera approssimata combinando opportu-namente il metodo di Newton e quello di Jacobi.

    L’utilizzo di tale combinazione, per un numero molto elevato di cellule(ovvero quando la dimensione del sistema da integrare è molto grande),risulta essere molto costosa dal punto di vista computazionale. Tale diffi-coltà può essere dovuta ai lunghi tempi richiesti dal metodo approssimatoper arrivare a convergenza o al tempo di calcolo necessario per costruire,immagazzinare ed elaborare i vettori e le matrici necessarie alla risoluzionedel problema.

    Obiettivo di questa tesi è quello di vedere quali sono le tecniche numericheattualmente usate per risolvere in modo efficiente problemi analoghi e forniresuggerimenti validi per un’eventuale miglioramento del programma di simu-lazione in termini di costo computazionale e velocità di convergenza.

    Per completezza ho quindi deciso di inserire una parte teorica piuttostoampia riguardo alle varie possibilità di affrontare il problema e ai vari risul-tati presenti in letteratura. Solo successivamente mi sono occupata di unparticolare esempio pratico semplificato.

    Nell’ambito delle simulazioni ho proposto anche una variante con controllodel passo temporale di integrazione.

    Contenuti della tesi

    Il Capitolo 2 è dedicato sostanzialmente al richiamo di nozioni e risultatibase di analisi numerica nell’ambito dell’integrazione di equazioni differenzialiordinarie e nell’ambito dei metodi iterativi per la risoluzione di sistemi lineari.

    Nei Capitoli 3-6 faccio una rassegna di metodi di risoluzione di sistemidi equazioni nonlineari, mirata alla possibile applicazione nell’ambito delsistema VBL.

    Nel Capitolo 3 introduco il metodo di Newton come strumento di rifer-imento nella risoluzione approssimata di equazioni nonlineari nella formagenerica F (x) = 0. Successivamente espongo alcune varianti di tale metodoe i relativi risultati di convergenza. Considero infine l’effetto dell’impiegodel metodo di Newton sull’equazione derivante dall’utilizzo di un genericometodo Runge-Kutta per l’integrazione di un problema differenziale ai valoriiniziali.

    Una particolare classe di metodi proiettivi ampiamente utilizzati nellarisoluzione numerica di sistemi lineari di grandi dimensioni viene presentatanel Capitolo 4. All’interno di tale classe, particolare rilievo viene dato almetodo GMRES che, combinato con opportune tecniche di precondiziona-

  • mento, risulta essere un valido strumento nelle applicazioni pratiche. A finecapitolo accenno alle principali tecniche di precondizionamento.

    Nei successivi due capitoli analizzo due diversi approcci per la risoluzioneapprossimata dell’equazione nonlineare F (x) = 0. Nel Capitolo 5, trattouna particolare classe di metodi di Newton inesatti, i metodi Newton-Krylov,e la possibilità di implementare tali metodi nell’ottica “matrix free”. NelCapitolo 6 considero invece l’approccio quasi-Newton e, di tale classe dimetodi, presento il metodo di Broyden.

    L’ultimo capitolo della tesi, il Capitolo 7, riguarda il simulatore VBL ealcune modellizzazioni semplificate. Tali modelli sono stati usati per effet-tuare dei test numerici e hanno permesso di verificare alcuni risultati teoricievidenziati nel corso della tesi.

  • Chapter 1

    Introduction

    VBL (Virtual Biophysics Lab) is a computational project with the objectiveof developing a numerical model for the simulation of tumor spheroids evolu-tion from their origin to the moment when the spheroids reach approximately1 mm diameter.

    The part treated in this thesis is that related to the diffusion and the ab-sorption and/or production of molecules in the system of cells. This part isdeterministic and is described by a complex system of differential equations.In this part of the simulation, the number of cells is fixed (this number is suc-cessively modified in the part of cellular events) and the contact interationsbetween cells are defined in a module of the program which computes fromtime to time the topology of the cluster of cells. Such a module is executedduring the initalization and then repeatedly during the main loop of the pro-gram. In the model currently implemented in the simulation program, thereare 25 differential equations per cell. These equations are coupled to eachother and to those relative to the adjacent cells.

    As the simulation proceeds, the number of cells present in the spheroidvaries and, when this spheroid reaches approximately 1 mm in diameter, thecells present in it are approximately one million. Consequently the simu-lator must integrate an increasing number of coupled nonlinear differentialequations. This involves having to resolve systems of dozens of millions ofequations.

    Due to the proliferation and death of the cells, the number of equationsto integrate changes from step to step. Moreover, the cellular movementslead to proximity relations which are in turn continuously changing. Thesechanges in the topology of the system produce continuous mutations in thestructure of the equations. For these reasons the algorithms used require anad hoc analysis of the stability, convergence and precision properties.

    1

  • At present the integration of the differential system is accomplished bymeans of the implicit Euler. At each time step (considered constant in thesimulation), the nonlinear equation deriving from the use of the implicit Euleris approximately resolved by appropriately combining Newton’s and Jacobi’smethods.

    The use of such a combination, for a very high number of cells (i.e., whenthe dimension of the system to be integrated is very large), becomes veryexpensive from the computational point of view. Such difficulty may be dueto the long times required by the approximate methods to reach convergenceor to CPU time necessary to construct, store and elaborate the vectors andmatrices necessary for the solution of the problem.

    The objective of this thesis is to see which are the present numericaltechniques used to efficiently solve analogous problems and to furnish validsuggestions for a possible improvement of the simulation program in termsof computational cost and speed of convergence.

    For the sake of completeness I have therefore decided to insert a ratherwide theoretical part concerning the various possibilities of approaching theproblem and the different results present in the literature. Subsequently Ihave dealt with a particular practical example.

    In the simulations I also proposed a variant with integration time step-sizecontrol.

    Contents of the thesis

    Chapter 2 is substantially dedicated to the review of definitions andbasic results of numerical analysis concerning the integration of ordinarydifferential equations and the iterative methods for the solution of linearsystems.

    In Chapters 3-6, I examine some methods for the solution of nonlinearequations, aiming at the possible application in the VBL system.

    In Chapter 3, I introduce the Newton method as reference tool for theapproximate solution of nonlinear equations in the generic form F (x) = 0.Subsequently, I expose some variants of the method and the respective con-vergence results. Finally, I consider the effect of applying Newton’s methodto the equation obtained by using a generic Runge-Kutta method for theintegration of a differential initial value problem.

    A particular class of projective methods, widely used for the numericalsolution of large-scale linear systems is presented in Chapter 4. In such aclass, the GMRES method is considered particularly relevant because, com-bined with suitable preconditioning techniques, it turns out to be a validtool in practical applications. At the end of the chapter, I outline the main

    2

  • preconditioning techniques.In the subsequent two chapters, I analyse two different approaches for the

    approximate solution of the nonlinear equation F (x) = 0. In Chapter 5,I treat a particular class of inexact Newton methods, namely the Newton-Krylov methods, and the possibility of implementing these methods in the“matrix free” perspective. In Chapter 6, instead, I consider the quasi-Newton approach and, of this class of methods, I introduce the Broydenmethod.

    The last chapter of the thesis, Chapter 7, deals with the VBL simulatorand some simplified modelizations. Such models have been used to performsome numerical tests and have allowed the verification of some theoreticalresults highlighted in the course of the thesis.

    3

  • 4

  • Chapter 2

    Background material innumerical analysis

    2.1 Numerical solution of differential equa-

    tions

    We now briefly recall the fundamental definitions, which are used when wehave to approximate numerically the solution of differential equations, andsome theoretical results which had been used in the particular cases dealtwith in this thesis. However, this Chapter consists of a specialized treatmentoriented towards analyzing cases inherent to the specific problem to be solved.

    Consider the Cauchy differential problem

    {y′(t) = f(t, y(t)) t0 ≤ t ≤ tf (possibly +∞)y(t0) = y0

    (2.1)

    where f : [t0, tf ] × Rs −→ Rs is a continuous function. Given a norm ‖ · ‖in Rs, suppose that f is globally Lipschitz continuous in the secondvariable, i.e., assume that ∃L > 0 such that

    ‖f(t, y)− f(t, z)‖ ≤ L‖y − z‖ (2.2)

    ∀t ∈ [t0, tf ] and ∀y, z ∈ Rs. L is called classical Lipschitz constant.

    Under the previous assumptions, the condition of continuous depen-dence on the initial data holds, i.e.

    ‖y(t)− z(t)‖ ≤ eL(t−t0)‖y0 − z0‖, (2.3)

    5

  • where z(t) is the solution to the Cauchy problem (2.1) with initial conditionz(t0) = z0.

    If in Rs we consider the norm induced by the inner product 〈· , ·〉 and weverify that ∃M ∈ R such that

    〈f(t, y)− f(t, z), y − z〉 ≤M‖y − z‖2 (2.4)

    ∀t ∈ [t0, tf ] and ∀y, z ∈ Rs, then the constant M is referred to as one-sidedLipschitz constant.

    Assuming hypothesis (2.4) we can prove a relation, similar to that of thecontinuous dependence on the initial data, in which the classical Lipschitzconstant is replaced by the one-sided Lipschitz constant, i.e.,

    ‖y(t)− z(t)‖ ≤ eM(t−t0)‖y0 − z0‖. (2.5)

    Differently from L, the constant M may also be negative. If M ≤ 0, theproblem under consideration is called dissipative.It holds that |M | ≤ L.

    If f ∈ C1([t0, tf ]× Rs), the classical Lipschitz constant can be expressedin terms of the Jacobian matrix ∂f

    ∂yas follows:

    L = supt∈[t0,tf ]

    y∈Rs

    ∥∥∥∥

    ∂f

    ∂y(t, y)

    ∥∥∥∥. (2.6)

    In general, it is clear that the constant defined by (2.6) is a classical Lipschitzconstant, but we cannot say that it is the “best one”, i.e., the smallestconstant such that relation (2.3) holds.

    In the particular case in which the Cauchy problem under considerationis scalar (K = 1), thanks to the mean value theorem of Lagrange, we obtainthat L defined by (2.6) is the best classical Lipschitz constant (in the abovesense).

    When we have a dissipative problem and, furthermore, L≫ 0 (i.e., whenthe classical Lipschitz constant is “very large”), we refer to (2.1) as a stiffproblem1. The essence of stiffness is as follows: The solution we want toapproximate varies slowly but perturbations exist which are rapidly damped(as in the case of strongly dissipative systems). See Subsection 2.1.1 forsome useful remarks on this situation.

    1Clearly the concept of “very large” must be related to the behaviour of the solutionwe want to approximate, and cannot be quantified exactly.

    6

  • In this thesis we consider only one-step methods for the numerical in-tegration of the differential equation (2.1). The temporal interval [t0, tf ] isdivided into subintervals by a grid (or mesh)

    {t0 < t1 < · · · < tn < tn+1 < · · · < tN = tf}

    and the difference between two subsequent nodes of this grid is called timestep. The time step may be variable, if hn+1 = tn+1 − tn depends on n, orconstant, when hn = h ∀n. In one-step methods for the numerical integrationof initial value problems, one computes the approximation of the solution attn+1 (denoted by yn+1) just using yn and an angular coefficient which is asuitable function of tn, yn, hn+1 and f :

    yn+1 = yn + hn+1 Φ(tn, yn, hn+1, f). (2.7)

    Given a method for computing the numerical solution of the problem (2.1),we define as global error at the point tn the quantity

    en = y(tn)− yn. (2.8)

    The method is called convergent if

    limh→0+

    max0≤n≤N

    ‖en‖ = 0. (2.9)

    In particular, the method is convergent of order p ≥ 1 if ∀f ∈ Cp

    max0≤n≤N

    ‖en‖ = O(hp). (2.10)

    2.1.1 Euler’s methods

    Consider the case of a grid with constant step-size h =tf−t0

    N. Consequently,

    the nodes of the grid are tn = t0 + nh. The following formulas describe theexplicit Euler method

    yn+1 = yn + h f(tn, yn), (2.11)

    and the implicit Euler method

    yn+1 = yn + h f(tn+1, yn+1). (2.12)

    Both methods are convergent of order p = 1. From each of these twomethods, we can obtain a method of order two exploiting the Richardsonextrapolation process, which, in general, allows us to obtain the formula

    7

  • of a method of order p + 1 starting from a method of order p. This formulais obtained combining the method of order p applied with time step h andwith time step h/2 in the following way:

    2pF (h/2)− F (h)2p − 1 , (2.13)

    where F represents the formula of the considered method of order p.Clearly, to prove the order of the above formulas it is necessary to exploit

    properties of the Taylor series expansion of the exact solution which mustthus be sufficiently regular.

    Obviously, when we apply the formula (2.13) to an explicit method, weobtain as a result another explicit method. On the other hand, when weapply the same formula to an implicit method, the result is still an implicitmethod.

    The implicit Euler method, as all other implicit methods, is much moreexpensive from a computational point of view than the corresponding explicitmethod. Indeed, at each step, one must solve an equation (such as (2.12))which, in general, is nonlinear.

    However, the need of introducing the implicit Euler method (and theimplicit combination of higher order) is justified by the advantages that theclass of implicit methods offers in the treatment of stiff problems.

    Stiff problems are characterized by a high level of stability and this brieflycauses the distancing of the numerical solution obtained with explicit meth-ods from the exact solution. Hence, the next value for the approximationis located on the trajectory of another solution (of the same problem witha different initial value) which has a very steep behaviour. Therefore, suchdistancing is accentuated at each time step and the approximation obtainedis, consequently, unacceptable.

    Wishing to avoid the effect due to the “excessive” stability of these sys-tems, using an explicit method one should choose an excessively small step-size. Implicit methods, on the contrary, allow the stability property of theproblem to be better exploited and to give accurate results with a muchlarger step-size.

    8

  • 2.2 Runge-Kutta methods

    Consider the Cauchy problem

    {y′(t) = f(t, y(t)), t0 ≤ t ≤ tfy(t0) = y0

    (2.14)

    where y0 ∈ Rs is given, f : [t0, tf ]×D → Rs with s ≥ 1, D ⊂ Rs is open andconvex. Suppose that f possesses continuous partial derivatives, up to thesecond order, in its domain.

    We want to find an approximation of the solution y(t) : [t0, tf ]→ Rs on adiscrete set of points in [t0, tf ]. Therefore, we divide the interval [t0, tf ] intosubintervals with extremes that constitute the reference grid or mesh.

    Let N ∈ N, we consider a mesh with constant step-size

    h =tf − t0N

    > 0. (2.15)

    Hence, the nodes of the mesh are defined by {tn = t0 + nh | n = 0, . . . , N}.

    Starting from the initial condition y0, the approximations yn provided bythe generic Runge-Kutta method at each node tn of the grid are given by theiterative one-step formula

    yn+1 = yn + h

    m∑

    i=1

    bi f(tn + ci h, ξi), (2.16)

    where every ξi ∈ D and the following relation holds:

    −ξi + yn + hm∑

    k=1

    aik f(tn + ck h, ξk) = 0 for i = 1, . . . , m. (2.17)

    The idea behind Runge-Kutta methods is that of considering the integralequation equivalent to the initial value problem (2.14) and approximating theintegral by using interpolatory quadrature formulas with suitable weights (biand aik). Parameters ci represent the nodes of such quadrature formulas.Parameters aik, bi and ci characterize the particular Runge-Kutta methodwe are using and are assigned so that

    m∑

    i=1

    bi = 1 and

    m∑

    k=1

    aik = ci for i = 1, . . . , m (2.18)

    hold.

    9

  • These conditions are imposed a priori on parameters, and due to themthe interpolatory quadrature formulas are at least of order one (they areexact when the function f on the right side of equation (2.1) is constant and,hence, the solution is a polynomial of degree one).m is referred to as the number of levels of the method.

    A useful way of representing a Runge-Kutta method is that of using thecorresponding Butcher scheme, i.e.,

    c A

    bT

    =

    c1 a11 . . . a1m...

    ......

    cm am1 . . . ammb1 . . . bm

    (2.19)

    The implicit Euler method, particularly significant in this thesis, is anexample of Runge-Kutta method with one level. Its Butcher scheme is

    1 11

    (2.20)

    Given a Runge-Kutta method, the stability function of the methodis the function φ : C→ C, defined by

    φ(ζ) = 1 + ζbT (I − ζA)−1u, (2.21)

    where u = (1, . . . , 1)T . Its region of absolute stability (or, simply, A-stability) is the set

    SA = {ζ ∈ C | det I − ζA 6= 0 and |φ(ζ)| ≤ 1}. (2.22)

    The Runge-Kutta method considered is called A-stable if

    SA ⊇ C− = {ζ ∈ C | Re(ζ) ≤ 0}. (2.23)

    2.2.1 Methods with step-size control

    Given the initial value problem (2.1), let zn+1(t) be the solution with initialcondition yn, i.e., {

    z′n+1(t) = f(t, zn+1(t))zn+1(tn) = yn.

    (2.24)

    The function zn+1(t) is called local solution, whereas y(t) is called globalsolution of (2.1).

    10

  • We define as local error the quantity

    σn+1 = zn+1(tn+1)− yn+1.

    Note that

    en+1 = y(tn+1)− yn+1= y(tn+1)

    ︸ ︷︷ ︸

    exact solutionwith init. cond. y(tn)

    − zn+1(tn+1)︸ ︷︷ ︸

    exact solutionwith init. cond. yn

    + zn+1(tn+1)− yn+1︸ ︷︷ ︸

    σn+1

    . (2.25)

    From (2.25) we obtain the following estimates exploiting the relation of con-tinuous dependence on initial data:

    ‖en+1‖ ≤ eM hn+1‖en‖+ ‖σn+1‖. (2.26)

    Choosing hn+1 such that ∀n

    ‖σn+1‖ ≤ ε hn+1, (2.27)

    where ε > 0 is a fixed quantity (referred to as per unit step tolerance),since e0 = 0, we have the estimation

    ‖eN‖ ≤ εN∑

    n=1

    eM(tf−tn)hn with tf = tN . (2.28)

    In (2.28), the sum is nothing but a rectangle quadrature formula for thefunction eM(tf−t) in the interval [t0, tf ]. Note that

    ∫ tf

    t0

    eM(tf−t)dt =1

    M(eM(tf−t0) − 1). (2.29)

    The right member of this equation depends substantially on the one-sidedLipschitz constant. Hence, to a first approximation, we can say that ∃K > 0sucht that

    ‖eN‖ ≤ ε K (2.30)

    How can we choose hn+1 and how can we estimate σn+1 in such a waythat relation (2.27) always holds?

    Let us assume that σn+1 is known. If p is the order of the method we choosefor the approximation, then we have that

    σn+1 = c(tn, yn)︸ ︷︷ ︸

    funct. indep. from h

    hp+1n+1 +O(hp+2n+1). (2.31)

    11

  • If at the previous time step we obtain an estimate of hn+1 such that rela-tion (2.27) does not hold, omitting higher order terms, it means that

    ‖σn+1‖ = ‖c(tn, yn)‖ hp+1n+1 > ε hn+1. (2.32)

    Hence, we must consider hnew such that (2.27) holds, i.e.,

    ‖c(tn, yn)‖ hp+1new ≤ ε hnew. (2.33)

    For good measure we impose

    ‖c(tn, yn)‖ hnew =1

    2ε hnew.

    Consequently, exploiting (2.32), which is true at a first approximation, forhnew we obtain the following estimate:

    hnew = hn+1p

    ε hn+12‖σn+1‖

    . (2.34)

    If ‖σn+1‖ ≫ ε hn+1, the equality in (2.32) may be too rough and, conse-quently, hnew could result unnecessarily too small. As a good measure frombelow, we impose

    hnew ≥1

    2hn+1. (2.35)

    Now we define hn+1 = hnew and redo the test (2.27). In case of failure, werepeat the procedure imposing a maximum number of possible consecutiverefusals (for example, two). After hn+1 has been accepted, we continue.Since

    σn+2 = c(tn+1, yn+1)︸ ︷︷ ︸

    =c(tn,yn)+O(hn+1)

    hp+1n+2 +O(hp+2n+2), (2.36)

    we obtain that, at a first approximation,

    σn+2 = c(tn, yn) hp+1n+2. (2.37)

    With the same techniques used above, we impose

    ‖c(tn, yn)‖ hp+1n+2 =1

    2ε hn+2, (2.38)

    and obtain

    hn+2 = hnew. (2.39)

    12

  • Here we fix an upper bound to prevent anomalous amplifications of thestep:

    hn+2 ≤ 2 hn+1. (2.40)Summing up and putting together (2.34), (2.35), (2.39) and (2.40), we

    compute

    hnew = hn+1 min{2,max{1

    2, R}},

    with

    R = p

    ε hn+12‖σn+1‖

    ,

    and we set hn+1 = hnew, if we need to repeat the procedure, or hn+2 = hnew,if we can continue.

    How do we choose the initial step?

    Usually for h1 one acts as follows:

    • choose an arbitrary step;

    • repeat the procedure h1 = hnew until the relation 14 ε h1 ≤ ‖σ1‖ ≤ ε h1is satisfied, where the first inequality prevents an excessively small step-size to be taken.

    To avoid too many iterations, when h1 is very imprecise, we can replace thelimitation

    1

    2h1 ≤ hnew ≤ 2 h1

    by1

    5h1 ≤ hnew ≤ 5 h1.

    What remains is to understand how to obtain an estimate of the local errorσn+1. We have several possibilities. Here we focus on two different strategieswhich underlie the class of Runge-Kutta-Fehlberg methods and the classof Dormand-Prince methods, respectively.

    In the first case (RKF) we consider a method of order p+1 to obtain at eachstep a better approximation of the local solution zn+1(t) at tn+1. Let y

    ∗n+1

    be such an approximation. Therefore, as estimate for σn+1 we assume thequantity

    σ∗n+1 = y∗n+1 − yn+1,

    13

  • which differs from the exact σn+1 by terms of order p+ 2, i.e., we have

    σn+1 = σ∗n+1 +O(h

    p+2n+1).

    To sum up, in the RKF approach, the method of order p is used to approxi-mate the solution, whereas the method of order p + 1 is needed to estimatethe local error.

    However, a different strategy (DoPri) exists, which proposes to use in a moreefficient way both the methods of order p and p+ 1. Indeed, the idea is thatof not to “waste” the method of higher order just to estimate the local error.Suppose we proceed with the method of higher order. We have that

    zn+1(tn+1)− y∗n+1 = d(tn, y∗n) hp+2n+1 +O(hp+3n+1), (2.41)

    zn+1(tn+1)− yn+1 = c(tn, y∗n) hp+1n+1 +O(hp+2n+1), (2.42)and, in particular, from equation (2.42), omitting the higher order terms, weobtain the already known relation

    ‖zn+1(tn+1)− yn+1‖ = ‖y∗n+1 − yn+1‖, (2.43)

    which is valid at a first approximation.It is now necessary to introduce a technical hypothesis in order to continue

    with the analytical treatment (even if this assumption is almost impossibleto verify in practice). We suppose that, in a neighbourhood of the solutionwe have

    c(t, y) 6= 0 ∀t, y. (2.44)Since the method is convergent, we get c(tn, y

    ∗n) 6= 0 ∀n if all the steps

    accepted by the algorithm are sufficiently small. Hence, we can write, alwaysat a first approximation,

    ‖zn+1(tn+1)− y∗n+1‖ =‖d(tn, y∗n)‖‖c(tn, y∗n)‖

    ‖y∗n+1 − yn+1‖ hn+1. (2.45)

    At this point, we define as e∗n = y(tn)− y∗n the global error of the method oforder p + 1. Condition

    ‖y∗n+1 − yn+1‖ ≤ ε (2.46)with a fixed ε > 0 implies

    ‖zn+1(tn+1)− y∗n+1‖ ≤ ε‖d(tn, y∗n)‖‖c(tn, y∗n)‖

    hn+1. (2.47)

    14

  • Similarly to what we have already seen,

    ‖e∗N‖ ≤ εN∑

    n=1

    ‖d(tn−1, y∗n−1)‖‖c(tn−1, y∗n−1)‖

    eM(tf−t0) hn,

    where the sum is an approximation of∫ tf

    t0

    ‖d(tn−1, y∗n−1)‖‖c(tn−1, y∗n−1)‖

    eM(tf−t)dt.

    Once again we obtain the existence of a constant K∗ > 0 such that, ifmax1≤n≤N{hn} is sufficiently small, then

    ‖e∗N‖ ≤ K∗ ε. (2.48)

    Notice that here the test (2.46) is carried out using a per step toleranceinstead of a per unit step tolerance. Moreover, besides depending on theone-sided Lipschitz constant, the constant K∗ also depends on the function‖d(t,y(t))‖‖c(t,y(t))‖

    . However, for a practical purpose, the situation doesn’t change

    (provided that the technical hypothesis (2.44) holds).In this strategy we must change the estimate for hnew. Indeed, the factor

    R to be adopted in this case becomes

    R = p+1√

    ε

    2 ‖σ∗n+1‖.

    2.3 Numerical tests

    In this Section, we consider a test equation (of which we already know thesolution) in order to compare the two strategies (RKF and DoPri) and verifysome considerations, made in Section 2.1, on the basis of some numericalexperiments performed using the Octave programming language.

    The aim is to test, in some pathological cases, the method with step-size control which will be used below to solve numerically the differentialproblems relevant in this thesis.

    For the sake of simplicity and in order to have an immediate graphicalvisualization of the results, we consider a scalar differential equation as testproblem. The generalization of the programs, in more than one dimension,will be obtained replacing absolute values with an appropriate vector norm.

    As test differential equation, consider{y′(t) = (y(t)− g(t))2 + g′(t), t ∈ [t0, tf ]y(t0) = g(t0).

    (2.49)

    15

  • It is immediate to prove that the above initial problem has the solutiony(t) = g(t). It is sufficient to assign the function g in order to know theexact solution.

    Define the function g so that its behaviour enhances the choice of a nu-merical method with step-size control, i.e., choose the exact solution in orderthat it presents some smooth features but also some sharp oscillations. Weconsider as function g the sum of two Gaussian functions

    g(t) = α1 e−

    t−t1σ1

    ”2

    + α2 e−

    t−t2σ2

    ”2

    . (2.50)

    The derivative g′ is then

    g′(t) = −2[α1(t− t1)

    σ21e−

    t−t1σ1

    ”2

    +α2(t− t2)

    σ22e−

    t−t2σ2

    ”2]

    . (2.51)

    In order to carry out the numerical tests, we assign the following valuesto the parameters of the equation (2.50):

    α1 = 1 σ1 = 1 t1 = 0α2 = 1 σ2 = 0.5 t2 = 5.

    (2.52)

    We deal with the numerical integration in the temporal interval [−4, 8] so asto focus on the behaviour of the approximations in the part where the exactsolution is subject to oscillations.

    In this interval, the exact solution has the behaviour shown in Figure 2.1.

    Figure 2.1: The exact solution.

    At this point, we implement the program for the numerical integrationwith step-size control in both strategies previously analized. As pair of meth-ods, we choose Euler as method of order one and, as method of order two,

    16

  • the one resulting from formula

    2E(h/2)−E(h), (2.53)

    i.e., the difference between the Euler method applied twice with step h/2 andthe same method applied once with step h (see the general formula (2.13)in the case of p = 1). Clearly, in order to obtain an explicit method as aresult of formula (2.53), we must consider explicit Euler. On the other hand,if we want the corresponding implicit version, we have to implement implicitEuler.

    In the following two tables, we report the results related to the maximumglobal error, emax, obtained in the numerical integration of the equation (2.49)implementing the methods so far mentioned (implicit Euler, implicit combi-nation of Euler’s methods, explicit Euler and explicit combination of Euler’smethods) with a constant step h. The four methods are denoted by Eimp,Combimp, Eexp and Combexp, respectively.

    We use Newton’s method with fixed precision equal to 10−5 for the solu-tion of the nonlinear equations deriving from the use of the various implicitEuler methods. To obtain more specific considerations on the error due tothe use of Newton’s method, see Chapter 3.

    h emax emax/hp

    Eimp 0.5 2.56 5.12000.25 0.236 0.94400.125 0.116 0.92800.0625 0.055 0.88000.03125 0.0273 0.87360.015625 0.0135 0.86400.0078125 0.00673 0.8614

    Combimp 0.5 0.226 0.90400.25 0.0264 0.42240.125 0.00446 0.28540.0625 0.00126 0.32260.03125 0.000323 0.33080.015625 0.0000812 0.33260.0078125 0.0000203 0.3326

    17

  • h emax emax/hp

    Eexp 0.5 0.832 1.66400.25 0.23 0.92000.125 0.116 0.92800.0625 0.0555 0.88800.03125 0.0273 0.87360.015625 0.0135 0.86400.0078125 0.00673 0.8614

    Combexp 0.5 0.116 0.46400.25 0.0231 0.36960.125 0.00537 0.34370.0625 0.00131 0.33540.03125 0.000326 0.33380.015625 0.0000814 0.33340.0078125 0.0000203 0.3326

    In the fourth column of the tables we listed the ratio between the maximumglobal error and hp, where p is the order of the considered method.

    From the theory we know that both the implicit and explicit Euler meth-ods are of order one, the two combinations are methods of order two and, ingeneral, for a method of order p (omitting the infinitesimals of higher order)it holds that

    emax = c hp, (2.54)

    where c is a constant which depends on the considered method.

    This is consistent with the fourth column of the tables: the ratios emax/hp

    seem to tend, as h approaches zero, towards a constant value in all four cases.Furthermore, simply by observing the third column, we see that, halving thestep in the methods of order one, we have an asymptotic halving of themaximum global error, whereas, in the cases of order two, halving the step,the error diminishes by a factor four.

    For the sake of simplicity, we choose the Euler method as basis method tohandle the problem of the numerical integration of a differential equation inthis thesis. Numerous other possibilities exist (such as Runge-Kutta methodsof higher order or multistep methods). However, since a high level of accuracyis not necessary for the problems dealt with in this thesis, we preferred tofavor the simplicity of Euler’s method rather than other methods, which aremore accurate but, at the same time, more expensive.

    18

  • Comparison of some results obtained with the two strategies

    We report now the results deriving from the application of the method withstep-size control to equation (2.49), where f is given by relation (2.50) withparameter values equal to those in (2.52).

    These results have been obtained by fixing h0 = 0.5 as initial step and10−5 as precision of the Newton method used in the resolution of the variousimplicit Euler methods present in the programs.

    In the following table we denote by ε the per unit step tolerance, by emaxthe maximum global error and by µ the number of steps executed by thealgorithm. In the last column we compute the increment of the number ofsteps, changing over from a particular tolerance to the subsequent one, i.e.,the ratio between the current number of steps executed and the previousvalue of the column “µ”, which we denote by µnew and µold, respectively.

    • Runge-Kutta-Fehlberg strategy:

    method ε emax µ µnew/µoldimplicit 0.1 1.45 137

    0.01 0.0337 1039 7.58390.001 0.00292 10305 9.91820.0001 0.000322 102939 9.9892

    explicit 0.1 0.412 940.01 0.0245 1033 10.98930.001 0.00454 10289 9.96030.0001 0.000317 102933 10.0042

    Diminishing by a factor 10 the per unit step tolerance corresponds torequiring an approximation which locally is 10 times better than theprevious one. From the table we can see how the accuracy of bothmethods turns out to be better of roughly a factor 10 at each reductionof the tolerance.

    In the RFK approach, to obtain the approximation of the solution atthe next point in the temporal grid, one proceeds with the method oflower order. In this case, the order is that of the Euler method, i.e.,one, and, consequently, the local truncation error is O(h).

    Considering an average step, it is thus reasonable to assume that, re-ducing the per unit step tolerance by a factor 10, the number of stepsexecuted by the algorithm globally increases approximately 10 times.

    19

  • This is the case in both the implicit and the explicit situation: indeed,in both cases the value of the fourth column is close to 10.

    Finally note that the explicit and the implicit methods behave on thesame quality level, i.e., they perfectly match the theoretical expecta-tions, since the original problem is not a stiff problem.

    In the following table, as above, we denote by emax the maximum global errorand by µ the number of steps executed by the algorithm. In the last columnwe still write the ratio between two subsequent values of the column “µ”(always µnew and µold), but here ε is the per step tolerance.

    • Dormand-Prince strategy:

    method ε emax µ µnew/µoldimplicit 0.1 0.0192 36

    0.01 0.00285 89 2.44720.001 0.000389 260 2.92130.0001 0.0000476 800 3.0769

    explicit 0.1 0.046 330.01 0.00502 87 2.63640.001 0.00184 256 2.94250.0001 0.0000851 797 3.1133

    First of all, we observe that, also with this strategy, the two methodsbehave on the same quality level, because the equation we want tointegrate is nonstiff.

    Differently from the RKF case, we proceed here with the method oforder two (the combination of Euler’s methods) and, therefore, thelocal discretization error is O(h2). Always considering an average step,locally requiring an error 10 times smaller means reducing the steps byabout

    √10 ≈ 3.1623 times. The numerical tests reflect this situation.

    Indeed, one sees that the values of the fourth column are getting closerto√

    10.

    From the above results, it is evident that, with equal tolerance, thesecond strategy (DoPri) allows us to obtain an even more precise ap-proximated solution while performing a definitely lower number of stepswith respect to the RKF case. This clearly follows from the fact that, inthe DoPri approach, the approximations of the solutions in each nodeof the mesh are provided by the method of higher order.

    20

  • 2.3.1 The stiff case

    Consider the initial value problem

    y′(t) = λ(y(t)− g(t)) + g′(t)︸ ︷︷ ︸

    f(t,y)

    , t ∈ [t0, tf ]

    y(t0) = g(t0).

    (2.55)

    Note that, for (2.55), we have

    supt∈[t0,tf ]

    y∈R

    ∥∥∥∥

    ∂f

    ∂y

    ∥∥∥∥

    = |λ|.

    Hence λ is the parameter on which “to play” in order to obtain a stiff problem(see the discussion at the beginning of Section 2.1).

    It can be immediately proved that also the Cauchy problem (2.55) hasthe unique solution y(t) = g(t). For the sake of simplicity and due to thereasons already mentioned, we consider the same function g as before (withthe same parameter values).

    First of all, note that, with a linear problem such as (2.55), the precisionof Newton’s method, used to solve the equation resulting from the use of animplicit method, does not influence the result at all. Indeed, that equation,for linear problems, remains linear and consequently only one iteration of themethod is sufficient in order to obtain the exact root. Actually, being ableto write the approximation yn+1 explicitly in terms of the approximation yn,we don’t need to use the Newton method at all.

    In the following table, where λ is the parameter which regulates stiffness,we show the value of the maximum global error obtained applying the variousmethods considered, with a fixed step h. The notations used are the sameas before.

    21

  • λ h emax emax emax emaxEimp Eexp Combimp Combexp

    −1 0.1 0.0895 0.112 0.003 0.003840.01 0.00936 0.0111 0.0000345 0.00003530.001 0.00094 0.0011 0.000000349 0.00000035

    −10 0.1 0.0334 0.169 0.00322 0.01740.01 0.00359 0.0165 0.0000795 0.00009660.001 0.000362 0.00164 0.000000903 0.000000921

    −100 0.1 0.00384 1.32 · 10106 0.000297 8.44 · 101480.01 0.000399 0.0172 0.0000363 0.00020.001 0.0000399 0.00171 0.000000867 0.00000106

    −1000 0.1 0.000391 1.34 · 10230 0.0000256 4.67 · 103070.01 0.00004 ∞ 0.000000245 ∞0.001 0.000004 0.00172 0.0000000364 0.000002

    It is clear that, the more the stiffness of the problem grows, the more thedifference between implicit and explicit method increases. In particular, thetwo explicit methods, when stiffness is high, give results which are completelyunacceptable, whereas the corresponding implicit methods perform well.

    Note that, for the implicit methods and, in the nonstiff case, also forthe explicit ones, the results obtained reflect the theoretical considerationsmade on the order of the methods at the beginning of Section 2.3. Forthe explicit methods, in the stiff case, this is not so evident because thechosen step-size is not sufficiently small to cut down the effect of stiffness.However, it is reasonable to suppose that, continuing to reduce the step, oneobtains results consistent with the theory of orders even in the case of explicitmethods.

    Now we want to test both the strategies RKF and DoPri on a stiff prob-lem, and we want to see how, within a particular approach, the pair of implicitmethods differ from the corresponding explicit pair.

    Following notations used in Section 2.3, in the next table we denote by εthe per unit step tolerance in the RKF case and the per step tolerance in theDoPri approach. emax is the maximum global error and µ is the number ofsteps executed by the algorithm.

    Implementing both strategies, starting from the initial step h0 = 0.5, weget the following results.

    22

  • • Runge-Kutta-Fehlberg strategy:

    λ method ε emax µ−1 implicit 0.1 0.0641 105

    0.01 0.0054 10310.001 0.000477 102930.0001 0.0000492 102929

    explicit 0.1 0.0938 1090.01 0.00515 10340.001 0.00189 102860.0001 0.0000492 102934

    −10 implicit 0.1 0.0372 60.01 0.00415 9790.001 0.000703 102240.0001 0.0000423 102859

    explicit 0.1 0.0771 1100.01 0.00414 10110.001 0.0012 102800.0001 0.00000891 102915

    −100 implicit 0.1 0.00388 60.01 0.00388 60.001 0.000398 96730.0001 0.0000765 102100

    explicit 0.1 0.0334 3880.01 0.000945 10420.001 0.0000462 101890.0001 0.000022 102774

    −1000 implicit 0.1 0.00039 60.01 0.00039 60.001 0.00039 60.0001 0.000108 94242

    explicit 0.1 0.00355 33410.01 0.00355 38870.001 0.000971 113740.0001 0.00000849 101977

    When λ = −1, i.e., when the problem is nonstiff, we notice a behavioursimilar to that seen in the nonstiff case previously analyzed. In fact,also in this case the implicit and the explicit method behave the sameway. In particular, when the per unit step tolerance is reduced bya factor 10, the number of steps increases 10 times. Furthermore, the

    23

  • accuracy reached by the two methods is roughly the same in both casesand is diminished by approximately a factor 10 at each reduction of thetolerance.

    As the order of magnitude of the module of the coefficient λ grows,i.e., as stiffness increases, the difference between the explicit and theimplicit method becomes more and more evident. Clearly, with equalper unit step tolerance, the method needing the smaller number ofsteps to terminate is the implicit one because, with this method, theapproximations provided are more accurate and allow us to accept amuch larger step-size. In particular, for the explicit method we see thata rough tolerance requires a high number of steps. On the contrary, forthe implicit method, which is highly stable, we observe that the stiffnesshelps to cut down the local error (as in the first row for λ = −10, in thefirst two rows for λ = −100 and in the first three rows for λ = −1000).However, this help becomes insufficient when too high precisions arerequired. In these cases we can see the role of accuracy.

    Therefore, as long as the stiffness has a strong effect, the differencebetween the number of steps required by the implicit method and thatrequired by the corresponding explicit method is very high. Whenε≪ 1, instead, such difference is no longer significant.

    As far as accuracy is concerned, one observes that, even with few steps,the implicit method provides very good approximations of the solutionat the considered nodes of the grid. However, we must take into ac-count the fact that the low number of elements of the mesh prevents thereproduction of the overall behaviour of the exact solution outside thegrid points, e.g., by using an interpolatory procedure. Consequently,from this point of view, the method may provide a globally unsatisfac-tory approximation, even though it is very accurate at the mesh nodes.In this regard, see Figure 2.2 where we have performed a linear inter-polation of the values obtained by the numerical approximation at thepoints of the grid.

    24

  • • Dormand-Prince strategy:

    λ method ε emax µ−1 implicit 0.1 0.0112 30

    0.01 0.00133 850.001 0.000187 2550.0001 0.0000246 794

    explicit 0.1 0.0351 310.01 0.00136 890.001 0.00108 2570.0001 0.0000291 798

    −10 implicit 0.1 0.0543 80.01 0.00162 570.001 0.000318 2150.0001 0.0000192 745

    explicit 0.1 0.0376 560.01 0.00432 1080.001 0.000521 2760.0001 0.0000319 814

    −100 implicit 0.1 0.00176 60.01 0.00134 80.001 0.000181 880.0001 0.0000421 478

    explicit 0.1 1.32 5480.01 0.0121 5670.001 0.00137 6270.0001 0.00065 1108

    −1000 implicit 0.1 0.000179 60.01 0.000179 60.001 0.00198 80.0001 0.0000176 105

    explicit 0.1 40.1 57210.01 40.1 57900.001 11.4 58850.0001 0.0638 5902

    Here also, the considerations related to the case λ = −1 are analogousto those we made for the nonstiff problem (2.49).

    Concerning the other values of the stiffness coefficient, we notice thatthe difference between the two methods is strongly evident comparedto the situation observed in the previous approach. In particular, note

    25

  • Figure 2.2: RKF, λ = −10, implicit method, per unit step tolerance = 0.1.

    the results obtained in terms of maximum global error in the caseλ = −1000. It is evident that the implicit method works very wellwhereas the explicit one provides an unacceptable approximate solu-tion.

    In any case, in the table we have that, in order to terminate, the implicitmethod requires a number of steps which is always absolutely lowerthan that required in the explicit case. Using the implicit method, theaccuracy of the approximation is also better and it is already very highwhen the per step tolerance is not too restrictive.

    Conclusions

    Comparing the previous two tables, we obtain that, as seems reasonable, theDoPri strategy requires many less steps than those needed in the RKF case(because we proceed with the method of higher order).

    Keeping this in mind and recalling the previous remarks, it is evident that,in any case, the more efficient approach is the DoPri one and, in particularfor stiff problems, it is more convenient to use the implicit method.

    The greater computational effort is in fact justified by the excellent resultsobtained in terms of error and number of steps in the implicit case.

    On the other hand, if the problem we deal with is nonstiff, it is preferableto apply the DoPri strategy with the explicit method. Indeed, as we haveseen, the two methods behave in a similar way and, hence, it is reasonableto opt for the method which is less expensive from a computational point ofview.

    26

  • 2.4 Iterative methods

    This Section begins by reviewing two basic iterative methods for solvinglinear systems. Given an N × N real matrix A and a real N -vector b, theproblem considered is: Find x ∈ RN such that

    Ax = b. (2.56)

    Equation (2.56) is a linear system, A is the coefficient matrix, b is theright-hand side vector, and x is the vector of unknowns.

    Consider the following decomposition of the coefficient matrix:

    A = D −E − F, (2.57)

    in which D is the diagonal of A, −E its strict lower part, and −F its strictupper part, i.e.

    A =

    bb −FD

    −E bb

    . (2.58)

    It is always assumed that the diagonal entries of A are all nonzero.In what follows, {x(k)} is the sequence of iterates, x(k)i is the i-th compo-

    nent of the iterate x(k), bi is the i-th component of the right-hand side b, andaij is the generic element of matrix A.

    The Jacobi iteration determines the i-th component of the next ap-proximation so as to annihilate the i-th component of the residual vectorb−Ax(k+1), i.e.,

    (b− Ax(k+1))i = 0. (2.59)This yields

    x(k+1)i =

    1

    aii

    bi −

    N∑

    j=1j 6=i

    aijx(k)j

    , for i = 1, . . . , N. (2.60)

    This is a component-wise form of the Jacobi iteration. The above notationcan be used to rewrite the Jacobi iteration (2.60) in vector form as

    x(k+1) = D−1(E + F )x(k) +D−1b. (2.61)

    Similarly, the Gauss-Seidel iteration corrects the i-th component of thecurrent approximate solution, in the order i = 1, . . . , N , again to annihilatethe i-th component of the residual. However, this time the approximate

    27

  • solution is updated immediately after the new component is determined.Therefore, we have the component-wise form

    x(k+1)i =

    1

    aii

    (

    bi −i−1∑

    j=1

    aijx(k+1)j −

    N∑

    j=i+1

    aijx(k)j

    )

    , (2.62)

    for i = 1, . . . , N , which immediately leads to the vector form

    x(k+1) = (D −E)−1Fx(k) + (D − E)−1b. (2.63)

    2.4.1 Iteration matrices and convergence

    The Jacobi and Gauss-Seidel iterations are of the form

    x(k+1) = Gx(k) + c, (2.64)

    in whichGJA = I −D−1A,GGS = I − (D −E)−1A (2.65)

    are the iteration matrices for the Jacobi and Gauss-Seidel iteration, re-spectively. Moreover, given the matrix splitting

    A = M −N, (2.66)

    a linear fixed-point iteration can be defined by the recurrence

    x(k+1) = M−1Nxk +M−1b, (2.67)

    which has the form (2.64) with

    G = M−1N = I −M−1A and c = M−1b. (2.68)

    For example, we have M = D, N = E + F for the Jacobi iteration, andM = D −E, N = F for the Gauss-Seidel iteration.

    If iteration (2.64) converges, its limit x satisfies

    x = Gx+ c. (2.69)

    In the two cases considered, due to (2.67), it is easy to see that x then satisfies

    x = M−1Nx+M−1b, i.e., Mx = Nx+ b. (2.70)

    Therefore, x is a solution of the original linear system (2.56).

    28

  • With regard to convergence, in [Saad, 2000] the following theorem isproved.

    Theorem 1 Let G be a square matrix such that ρ(G) < 1. Then I − G isnonsingular and the iteration (2.64) converges for any c and initial iteratex(0).

    Besides knowing that the sequence (2.64) converges, it is desirable to know“how fast” it converges. Let e(k) be the error at step k, i.e., e(k) = x(k) − x(where x is the limit of the sequence {x(k)}).

    The convergence factor ρ of a sequence is the limit

    ρ = limk→∞

    ( ||e(k)||||e(0)||

    )1/k

    , (2.71)

    while the convergence rate τ is the natural logarithm of the inverse of theconvergence factor, i.e.,

    τ = − ln ρ. (2.72)The above definition depends on the initial vector x0, so it may be called

    specific convergence factor. A general convergence factor of a sequencecan be defined as the limit

    ρ̄ = limk→∞

    (

    maxx0∈RN

    ||e(k)||||e(0)||

    )1/k

    . (2.73)

    Since e(k) satisfiese(k) = Gke(0), (2.74)

    it follows that

    ρ̄ = limk→∞

    (

    maxx0∈RN

    ||Gke(0)||||e(0)||

    )1/k

    = limk→∞

    (||Gk||

    )1/k. (2.75)

    As an application of the Jordan canonical form of a matrix (see [Saad, 2000]),for any norm we have

    limk→∞

    (||Gk||

    )1/k= ρ(G). (2.76)

    Thus, the general convergence factor is equal to the spectral radius of theiteration matrix G.

    29

  • 30

  • Chapter 3

    Newton’s method

    Newton’s method, or Newton-Raphson’s method (NR), is one of the classicaliterative methods for approximating the solution of nonlinear equations. Herewe use the standard notation

    F (x) = 0 (3.1)

    for systems of N equations in N unknowns.

    In this Chapter the vector x∗ denotes a solution, x a potential solutionand {xj}j≥0 a sequence of iterates. We denote the i-th component of a vectorx by ξi and the i-th component of xj by ξ

    (j)i . The notation ∂g/∂ξi indicates

    the partial derivative of a function g with respect to ξi. As is standard,e = x−x∗ denotes the error. So, for example, ej = xj−x∗ is the error in thej-th iterate. If F : RN → RN and the components Fi of F are differentiableat x ∈ RN for i = 1, . . . , N , we define the Jacobian matrix J(x) = F ′(x)by

    J(x)ik =∂Fi∂ξk

    (x). (3.2)

    Let || · || denote a norm on RN . Iterative methods can be classified by theirrate of convergence.

    Let {xj} ⊂ RN and x∗ ∈ RN . Then

    • xj → x∗ q-quadratically if xj → x∗ and there is K > 0 such that

    ||xj+1 − x∗|| ≤ K||xj − x∗||2. (3.3)

    31

  • • xj → x∗ q-superlinearly with q-order α > 1 if xj → x∗ and thereis K > 0 such that

    ||xj+1 − x∗|| ≤ K||xj − x∗||α. (3.4)

    • xj → x∗ q-superlinearly if

    limj→∞

    ||xj+1 − x∗||||xj − x∗||

    = 0. (3.5)

    • xj → x∗ q-linearly with q-factor σ ∈ (0, 1) if

    ||xj+1 − x∗|| ≤ σ||xj − x∗|| (3.6)

    for j sufficiently large.

    In the following, where it is not stated differently, || · || denotes the Euclideannorm on RN :

    ||x|| =(

    N∑

    i=1

    |ξi|2)1/2

    . (3.7)

    3.1 Newton’s sequence

    The sequence of iterates generated by Newton’s method is obtained by

    F ′(xj)(xj+1 − xj) = −F (xj). (3.8)

    Let j be fixed. To understand the meaning of such a choice for the iterationprocess, consider the Taylor series expansion of F (x) with respect to xj . Wecan write

    F (x) = F (xj) + F′(xj)(x− xj) +O(||x− xj ||2). (3.9)

    The idea behind (3.8) is that we replace F (x) with a linear functionMj(x)given by the Taylor series expansion of F , with respect to xj , truncated atthe first order, i.e.,

    Mj(x) = F (xj) + F′(xj)(x− xj). (3.10)

    Then, we let the root of Mj be the next iteration.

    32

  • The computation of a Newton iteration requires:

    1. the evaluation of F (xj) and a test for termination;

    2. the approximate solution of the equation

    F ′(xj)sj = −F (xj) (3.11)

    for the Newton step sj;

    3. the definition of xj+1 = xj + sj.

    The computation of the Newton step (item 2) requires most of the workand the variations in Newton’s method discussed in this thesis differ mostsignificantly in how the step sj is approximated. The computation of sj mayrequire evaluation and factorization of the Jacobian matrix or the solutionof the equation (3.11) by an iterative method.

    3.1.1 Local convergence theory

    Let us consider the following standard assumptions:

    1. equation (3.1) has a solution x∗ ∈ RN ;

    2. F ′ is Lipschitz continuous in a neighborhood of x∗;

    3. F ′(x∗) is nonsingular.

    Recall that Lipschitz continuity in a neighborhood of x∗ means that thereis L > 0 such that

    ||F ′(x)− F ′(y)|| ≤ L||x− y|| (3.12)for all x, y sufficiently close to x∗.

    A well-known convergenge theorem is as follows:

    Theorem 2 Let the standard assumptions hold. If x0 is sufficiently near x∗,

    then the Newton sequence exists (i.e., F ′(xj) is nonsingular for all j ≥ 0), itconverges to x∗ and there is c > 0 such that

    ||ej+1|| ≤ c||ej||2 (3.13)

    for sufficiently large j.

    See [Ortega & Rheinboldt, 1970] for proof.

    33

  • The convergence described by (3.13), in which the error in the solutionis roughly squared with each iteration, is called q-quadratic (as alreadystated). Squaring the error roughly means that the number of significantdigits in the result doubles with each iteration.

    The assumption that the initial iterate be “sufficiently near” the solutionx∗ may seem artificial, but there are many situations in which the initialiterate is very near the root. For example, consider the case of the implicitintegration of ordinary differential equations, in which the initial iterate isderived from the solution at the previous time step.

    3.1.2 Termination of the iteration

    From the general theory about fixed-point iteration (see [Kelley, 1995]) it ispossible to obtain the following result:

    Proposition 1 Assume the standard assumptions hold. Then, for x suffi-ciently near x∗,

    ||e||4||e0||κ(F ′(x∗))

    ≤ ||F (x)||||F (x0)||≤ 4||e||κ(F

    ′(x∗))

    ||e0||, (3.14)

    where κ(F ′(x∗)) = ||F ′(x∗)|| ||F ′(x∗)−1|| is the condition number of F ′(x∗)related to the norm || · ||.

    From Proposition 1, we conclude that if F ′(x∗) is well conditioned, the sizeof the relative nonlinear residual ||F (x)||/||F (x0)|| is a good indicator ofthe size of the error. Therefore, we may terminate the iteration when therelative nonlinear residual is small. However, if there is error in the evaluationof F or the initial iterate is near a root, a termination decision based on therelative residual may be made too late in the iteration.

    A better choice for the termination criterion is to use

    ||F (x)|| ≤ τr||F (x0)||+ τa, (3.15)

    which allows to have more control on the relative and absolute size of thenonlinear residuals. The relative error tolerance τr and the absolute errortolerance τa are input to the algorithm used to solve (3.1).

    Another way to decide when to stop is to look at the Newton step

    sj = −(F ′(xj))−1F (xj) = xj+1 − xj, (3.16)

    34

  • and terminate the iteration when ||sj|| is sufficiently small. This criterion isbased on Theorem 2, which implies that

    ||ej || = ||sj||+O(||ej||2). (3.17)

    Hence, near the solution sj and ej are essentially the same size.

    3.2 Approximations to the Jacobian

    As already seen, in order to compute the Newton iterate xj+1 from a currentpoint xj one must first evaluate F (xj) and decide whether to terminate theiteration. If one decides to continue, the Jacobian F ′(xj) must be computedand factorized. The step is computed as the solution of F ′(xj)s = −F (xj)and the iterate is updated xj+1 = xj + s.

    Factorization of F ′ in the dense case costs O(N3) floating-point opera-tions. Evaluation of F ′ by finite differences should be expected to cost Ntimes the cost of an evaluation of F because each column in F ′ requires anevaluation of F to form the difference approximation. Hence the cost of aNewton step may be roughly estimated as N +1 evaluations of F and O(N3)floating-point operations. In many cases, F ′ can be computed more effi-ciently, accurately, and directly than with differences and the analysis abovefor the cost of a Newton step is very pessimistic.

    An approach to reduce the cost of computing and factorizing F ′(xj) is toapproximate the Jacobian F ′(xj) in a way that not only avoids computationof the derivative, but also saves linear algebra work and matrix storage.

    The price for such an approximation is that the nonlinear iteration con-verges more slowly, i.e., more iterations are needed to solve the problemwith the same precision. However, the overall cost of solving (3.11) is usu-ally significantly lower, because the computation of the Newton step is lessexpensive.

    A way to approximate the Jacobian is to compute F ′(x0) and use itas an approximation to F ′(xj) throughout the iteration process. This ideadescribes the modified Newton method (MNR), where the sequence ofiterates {xj} is obtained from the relation

    F ′(x0)(xj+1 − xj) = F (xj). (3.18)

    More generally, if we consider the method given by the equation

    F ′(w0)(xj+1 − xj) = F (xj), (3.19)

    35

  • in which w0 may also be different from x0, we have the simplified Newtonmethod (SNR).

    Another way to approach this, is to compute an approximation to theJacobian along with an approximation to x∗ and update the approximateJacobian as the iteration progresses (see Broyden’s method in Chap-ter 6). Instead of approximating the Jacobian, one could solve the equationof the Newton step approximately. This is the idea behind another class ofmethods, namely the inexact Newton methods, which will be treated inChapter 5.

    The above approaches are all reviewed by Kelley in [Kelley, 2003].

    3.3 Errors in the function and the derivative

    Suppose that F and F ′ are computed inaccurately so that F + ε and F ′ + ∆are used instead of F and F ′ in the iteration. If ∆ is sufficiently small, theresulting iteration can return a result that is an O(ε) accurate approximationto x∗. If, for example, ε is entirely due to floating-point roundoff, there isno reason to expect that ||F (xj)|| will ever be smaller than ε in general. Werefer to this phase of the iteration in which the nonlinear residual is no longerbeing reduced as stagnation of the iteration.

    Theorem 3 Let the standard assumptions hold. Then there are K̄ > 0 andδ > 0 such that if xj is sufficiently near x

    ∗ and ||∆(xj)|| < δ, then

    xj+1 = xj − (F ′(xj) + ∆(xj))−1(F (xj) + ε(xj)) (3.20)

    is defined (i.e., F ′(xj) + ∆(xj) is nonsingular) and satisfies

    ||ej+1|| ≤ K̄(||ej||2 + ||∆(xj)|| ||ej||+ ||ε(xj)||). (3.21)

    See [Kelley, 1995] for proof.

    3.3.1 The modified Newton method

    Recall that the modified Newton method is given by

    xj+1 = xj − F ′(x0)−1F (xj). (3.22)

    Following notations of Theorem 3 we have

    ε(xj) = 0, ∆(xj) = F′(x0)− F ′(xj). (3.23)

    36

  • Hence, if xj and x0 are sufficiently near x∗,

    ||∆(xj)|| ≤ L||x0 − xj || ≤ L(||e0||+ ||ej||). (3.24)

    Applying Theorem 3, we obtain the following result:

    Proposition 2 Let the standard assumptions hold. Then there is Kc > 0such that if x0 is sufficiently near x

    ∗, the modified Newton iterates convergeq-linearly to x∗ and

    ||ej+1|| ≤ Kc||e0|| ||ej||. (3.25)

    3.4 Newton iteration error

    Consider the problem (2.14). Let

    x =

    ξ1...ξm

    ∈ Dm and F (x) =

    F1(x)...

    Fm(x)

    ∈ Rsm (3.26)

    with

    Fi(x) = ξi − yn − hm∑

    k=1

    aik f(tn + ck h, ξk). (3.27)

    Using these notations, computing a solution of (2.14) with the generalRunge-Kutta method is equivalent to finding a solution of the nonlinearproblem

    F (x) = 0. (3.28)

    Consider now a problem such as (2.14). If we want to approximate itssolution by means of a fixed Runge-Kutta method, we must solve, at eachtime step, an equation like (3.28). Assume that one of the three variantsof Newton’s method ( (3.8), (3.18) or (3.19)) is used, for all steps1. In thefollowing, an overview is provided of the results found in literature concerningthe error made by stopping Newton’s iteration, at each time step, after jiterations.

    1In practice, the most utilized among these three methods is (3.19), because it allowsto exploit in different time steps the same factorization of the Jacobian matrix. Indeed,in this iteration procedure it is also possible to choose w0 6= x0.

    37

  • 3.4.1 Liniger’s pioneering work

    One of the first works on this topic is due to Liniger [Liniger, 1971]. Heprovided some particular error estimates for the problem (2.14) in which fis nonlinear, assuming that an implicit linear multistep formula is used toapproximate the solution. With some slight modifications, the same type ofestimates can be obtained when an implicit Runge-Kutta method is used inthe integration steps.

    Consider the problem (2.14) and the general Runge-Kutta method de-scribed by (2.16). Let x = (ξ1, . . . , ξm)

    T and define a function ϕ : Dm → Rsm,ϕ(x) = (ϕ1(x), . . . , ϕm(x))

    T , where

    ϕi(x) = yn + h

    m∑

    k=1

    aikf(tn + ck h, ξk) for i = 1, . . . , m. (3.29)

    At each time step we must solve the system

    x = ϕ(x). (3.30)

    We assume the function f possesses continuous partial derivatives up to thesecond order. Following Liniger’s argument, we analyse the error made bystopping the NR iteration, or a variant thereof (MNR o SNR), after a fixednumber j of iterations.

    We show that, starting from an initial iterate x0 which differs from theexact solution x of (3.30) by O(hq), with a given integer q ≥ 1, the errormade is O(hR(j)) where

    R(j) = (q + 1)2j − 1 for the NR method, (3.31)R(j) = q + j(q + 1) for the MNR method. (3.32)

    Furthermore, assuming that w0 differs from the exact solution x by O(hr),

    with a given integer r such that 1 ≤ r ≤ q, we have

    R(j) = q + j(r + 1) for the SNR method. (3.33)

    The Newton iteration for solving equation (3.30) becomes

    xj+1 = ϕ(xj) + ϕ′(xj)(xj+1 − xj). (3.34)

    The Jacobian matrix of ϕ has the following explicit form

    ϕ′ = h diag(J(ti, ξi))(A⊗ Is). (3.35)

    38

  • In (3.35) ti = tn+ci h, J(t, ξ) is the Jacobian matrix of f with respect to ξ on(t, ξ), Is is the s-dimensional identity matrix, diag(Ui) is the block diagonalmatrix with diagonal blocks Ui for 1 ≤ i ≤ s, and ⊗ denotes the Kroneckerproduct of matrices2.

    If by ej = xj − x we denote the deviation of the j-th approximation fromthe exact solution of (3.30), we obtain

    xj+1 − xj = ej+1 − ej (3.37)

    and, by Taylor’s series expansion of the function ϕ,

    ϕ(xj) = ϕ(x) + ϕ′(x)ej +

    12ϕ′′(x)e2j +O(e

    3j),

    ϕ′(xj) = ϕ′(x) + ϕ′′(x)ej +O(e

    2j).

    (3.38)

    Explicitly evaluating ϕ′′ we get a formula which is the analogue of (3.35)for ϕ′. Always following Liniger’s argument, both quantities are consideredO(h).

    Using relation (3.38) we have

    ϕ(xj)− ϕ′(xj)ej = ϕ(x)−1

    2ϕ′′(x)e2j +O(e

    3j)

    and, thanks to (3.34) and (3.30), the following relation holds:

    ej+1 = ϕ(xj) + ϕ′(xj)ej+1 − ϕ′(xj)ej −xj + ej

    ︸ ︷︷ ︸

    −x=−ϕ(x)

    .

    As a consequence, we can write

    {I − [ϕ′(x) + ϕ′′(x)ej +O(e2j)]}ej+1 = −1

    2ϕ′′(x)e2j +O(e

    3j). (3.39)

    By induction, we show that (3.31) holds.

    If j = 0 then R(j) = q and, therefore, exploiting the hypothesis on x0,the basis step of induction is satisfied.

    2Let B be a p× q matrix and let C be a l × n matrix, then B ⊗ C is a pl × nq blockmatrix which has the following form:

    B ⊗ C =

    b11C . . . b1qC...

    ...bp1C . . . bpqC

    (3.36)

    39

  • Assume now that (3.31) is valid for a fixed j > 0. Note that, approximat-ing ϕ′′ and ϕ′ as O(h), the coefficient matrix of ej+1 is I + O(h). It followsthat, for sufficiently small h, the inverse of this coefficient matrix has theform I +O(h) as well.

    Then by relation (3.39) we obtain that ej+1 = O(hg), where

    g = 2R(j) + 1 = (q + 1)2(j+1) − 1 = R(j + 1),

    i.e., (3.31) is valid for j + 1. In conclusion, (3.31) is valid for j ≥ 0.

    Relations (3.32) and (3.33) can be proved with similar arguments.

    3.4.2 Analogous estimates by Sugiura and Torii

    Estimates, analogous to those in [Liniger, 1971], were derived also by Sugiuraand Torii in [Sugiura & Torii, 1991]. Following notations used up to now,their work can be rewritten as follows.

    Let x and F be as in (3.26) and consider Newton’s method describedby (3.31) with the function F under consideration.In this case, the Jacobian matrix F ′(x) can be written in explicit form as

    F ′(x) = Ism − h diag(J(ti, ξi))(A⊗ Is) (3.40)

    where, as above, ti = tn+ci h, J(t, ξ) is the Jacobian matrix of f with respectto ξ on (t, ξ), Is is the s-dimensional identity matrix, diag(Ui) is the diagonalblock matrix with blocks Ui for 1 ≤ i ≤ s, and ⊗ denotes the Kroneckerproduct of matrices.

    Sugiura and Torii, in their paper, consider the infinity norm3.

    The following theorem, due to Kantorovich, is essential in their analysisof the convergence of Newton’s method.

    3Recall that the infinity norm || · ||∞ of a vector v ∈ Rn is defined as

    ||v||∞ = maxi=1,...,n

    |vi|,

    where vi is the i-th component of the vector.

    40

  • Theorem 4 Let x0 ∈ Rsm and B(r) := {x| ||x − x0|| ≤ r} be a closed ball.Suppose that

    Γ = [F ′(x0)]−1exists; (3.41)

    ||ΓF (x0)|| ≤ η; (3.42)||ΓF ′′(x)|| ≤ K, for x ∈ B(r). (3.43)

    Then, provided

    α = Kη <1

    2and

    1−√

    1− 2αα

    η = r0 ≤ r < r1 =1 +√

    1− 2αα

    η, (3.44)

    equation F (x) = 0 has the unique solution x∗ ∈ B(r) and the Newton processconverges towards it. Furthermore,

    ||x∗ − xj || ≤1

    2j(2α)2

    j η

    α, for j ≥ 0. (3.45)

    In the case considered,

    [F ′(x0)]−1 = (Ism − h diag(J(ti, ξi))(A⊗ Is))−1 h→0−→ Ism (3.46)

    ||ΓF ′′(x)|| ≤ ||Γ|| max||u||=1

    ||h[diag(H(ti, ξi))(A⊗ Is)u](A⊗ Is)||

    ≤ h||Γ|| ||A||2 max1≤i≤s

    ||(H(ti, ξi))|| = O(h) (3.47)

    where H(ti, ξi) is the second derivative of f on (ti, ξi).

    From Theorem 4 and the above remarks, Sugiura and Torii obtain thefollowing result.

    Theorem 5 SupposeF (x0) = O(h

    q) (3.48)

    for some q ≥ 1. Then, for sufficiently small h, the problem F (x) = 0 hasthe unique solution x∗ and

    xj = x∗ +O(hR(j)) with R(j) = 2j(q + 1)− 1. (3.49)

    Proof From relations (3.46) and (3.47) we obtain that (3.41) and (3.43) holdfor sufficiently small h. Furthermore, the asymptotic property (3.46) andhypothesis (3.48) imply that ||ΓF (x0)|| = O(hq) and we can take η = O(hq).Hence, from (3.47), we have α = O(hq+1) and r0 = O(h

    q).

    41

  • As a consequence, conditions (3.42) and (3.44) of Theorem 4 are satisfiedfor sufficiently small h. Therefore F (x) = 0 has the unique solution x∗ and,from (3.45), we get (3.49). �

    Remark For sufficiently small h, Taylor’s series expansion yields

    x0 = x∗ +O(hq) =⇒ F (x0) = O(hq). (3.50)

    We can conclude that it is possible to apply Theorem 5 under the assump-tion that x0 = x

    ∗ +O(hq).

    3.5 Spijker’s criticism on the stiff case

    In 1994, van Dorsselaer and Spijker published a paper on the numerical solu-tion of stiff initial value problems for nonlinear ODE’s with implicit methods[Dorsselaer & Spijker, 1994]. In this paper they emphasised that, for stiffproblems, estimates obtainded by Liniger, such as those obtained by Sugiuraand Torii, are not reliable. Indeed, stiff problems may be characterized byvery large magnitude in the Jacobian matrix of the function h · f(t, ξ) withrespect to ξ. Therefore, replacing the first- and second-order derivatives ofh · f(t, ξ) (with respect to ξ) with O(h) leads to estimates which are relevantonly to the nonstiff case. For stiff problems it is questionable whether theO-constant in (3.49) is still of moderate size or not.

    In the light of such considerations, van Dorsselaer and Spijker make aspecific analysis of the error for stiff problems (only autonomous systemsare considered), particularly focusing on linear multistep methods for theapproximate solution of the problem. The possibility of extending the resultsobtained to the case of Runge-Kutta methods is briefly illustrated in the finalpart of the paper.

    The interaction between Runge-Kutta methods and Newton’s iteration islater resumed by Spijker in [Spijker, 1994]. In this work he generalizes theresults in the case of a generic (not necessarily autonomous) stiff initial valueproblem.

    42

  • 3.5.1 Spijker’s paper

    Spijker considers a Cauchy’s problem like (2.14), where f is nonlinear, theinterval considered is [0, T ] and the problem is assumed to be stiff4. In hispaper, Spijker uses the Euclidean norm (usually denoted by || · ||2) and therelated induced matrix norm.

    Let A be the matrix with elements aij of the particular Runge-Kuttamethod considered. Assume that an m ×m diagonal matrix D exists suchthat both

    D and DA+ ATD are positive definite. (3.51)

    Concerning the function f , Spijker makes the following assumptions:

    J(t, η) = J(t, ξ)[I + e(t, ξ, η)] ∀t ∈ [0, T ] and ξ, η ∈ Dwith ||e(t, ξ, η)||2 ≤ λ(t)||ξ − η||2 and λ(t) of moderate size.(3.52)

    ||Ṽ (τ1)− V (τ1)||2 ≤ ||Ṽ (τ0)− V (τ0)||2 whenever V, Ṽ are any twosolutions of (2.14) on [τ0, τ1] ⊂ [0, T ]. (3.53)

    Note that, if the original problem is dissipative, hypothesis (3.53) is cer-tainly satisfied.

    Let δ be the distance of the set {y(t)|t ∈ [0, T ]} from the boundary of D (ifD = Rs we put δ =∞) and define

    ||y′||sup = supt∈[0,T ]

    ||y′(t)||2;

    ||λ||sup = supt∈[0,T ]

    ||λ(t)||2.

    Let tn ∈ [0, T ), yn ∈ D be given and consider h such that 0 < h ≤ T − tn.Spijker assumes that the initial iterate x0 of Newton’s method (or a vari-

    ant thereof) satisfies||x0 − x∗||2 = O(hq) (3.54)

    with O-constant of moderate size and q ≥ 1.Recall that Sugiura and Torii, assuming (3.54), for the NR process obtain

    the estimation

    ||xj − x∗||2 = O(hR(j)) with R(j) = 2j(q + 1)− 1 for j ≥ 1, (3.55)4For the general concept of stiffness we refer to [Dekker & Verwer,1984]. See

    also [Spijker, 1996] for some useful remarks.

    43

  • which is reliable only in the nonstiff case, as we have already seen in Sec-tion 3.5.

    From the material provided in [Dorsselaer & Spijker, 1994], the next tworesults easily follow.

    Theorem 6 Assume (3.53). Then ∃α > 0 such that F (x) = 0 has a uniquesolution x∗ ∈ Dm, whenever yn and h satisfy

    ||yn − y(tn)||2 ≤ α δ and 0 < h ||y′||sup ≤ α δ. (3.56)

    Theorem 7 Let K > 0 and q > 0 be given. Assume (3.52) and (3.53).Then ∃α > 0 such that, if yn, h, x0, K and q satisfy

    ||yn − y(tn)||2 ≤ α δ, 0 < h ||y′||sup ≤ α δ (3.57)

    and

    ||x0 − x∗||2 ≤ Khq ≤1

    2min{δ, α||λ||sup

    }, (3.58)

    then F (x) = 0 has a unique solution x∗ ∈ Dm. Furthermore, there are uniquexj ∈ Dm, generated by the Newton process with initial iterate x0 ∈ Dm, suchthat

    ||xj − x∗||2 ≤( ||λ||supKhq

    α

    )2j−1

    Khq. (3.59)

    Specifically, Theorem 7 shows how, under the assumptions (3.52) and(3.53), relation (3.54) with O-constant K of moderate size implies

    ||xj − x∗||2 = O(hQ(j)) where Q(j) = 2jq for j ≥ 1. (3.60)

    In (3.60) we have an O-constant

    Kj =

    ( ||λ||supKα

    )2j−1

    K (3.61)

    which is not altered by stiffness in the case in which the original prob-lem (2.14) is stiff. Kj is of moderate size when j is small.

    Orders R(j) and Q(j), analogous to (3.55) and (3.60), can be obtainedfor iterative processes MNR and SNR. Let w0 be an approximation of x

    ∗,which can also be different from x0, such that

    ||w0 − x∗||2 = O(hr), (3.62)

    44

  • with O-constant of moderate size and 0 < r ≤ q. Hence, following Sugiuraand Torii’s argument, i.e., ignoring stiffness, we obtain the estimation

    ||xj − x∗||2 = O(hR(j)) with R(j) = (r + 1)j + q for j ≥ 1. (3.63)

    On the other hand, taking stiffness into account and under assumptions (3.52)and (3.53), we have

    ||xj − x∗||2 = O(hQ(j)) with Q(j) = rj + q for j ≥ 1, (3.64)

    where the O-constant is not affected by stiffness.

    3.5.2 Local and global stopping error

    Consider a fixed Runge-Kutta method. In what follows, assume that:

    • vn is the approximation of y(tn) generated by the theoretical Runge-Kutta method starting with v0 = y0;

    • yn is the approximation of y(tn) obtained, starting with y0, when,at each step of the Runge-Kutta method, the solution to the equa-tion (2.17) is approximated by performing j iterations of Newton’smethod (or a variant thereof).

    In his paper, Spijker determines the order of the global stopping errorDn,j defined by

    Dn,j = vn − yn, (3.65)i.e., the error made by stopping Newton’s iterative process after j iterations.We assume that

    yn+1 = yn +m∑

    k=1

    bkfk, (3.66)

    with vectors fk ∈ Rs obtained by the linear system

    hm∑

    k=1

    aikfk = ξi,j − yn for i = 1, . . . , m. (3.67)

    Here ξ1,j,. . . , ξm,j belong to D and are equal to the subvectors of the j-thiterate xj ∈ Dm, computed during the (n+ 1)-th time step, i.e.,

    ξ1,j...

    ξm,j

    = xj . (3.68)

    45

  • Note that, as a consequence of hypothesis (3.51), matrix A is invertible(see Section 8.2). Therefore unique fk exist satisfying (3.67).

    In the following, we confine ourselves to the situation in which the sim-plified Newton method given by (3.19) is used. Assume that

    w0 = x0 ∈ Dm with subvectors ξi,0 = yn ∈ D for i = 1, . . . , m. (3.69)

    In order to obtain an estimate of the errors Dn,j it is useful to define localstopping errors dn,j as

    dn,j = ṽn+1 − ỹn+1, (3.70)

    where

    • ṽn+1 denotes the approximation of y(tn+1) defined by the Runge-Kuttaformulae (2.16) and (2.17) where yn is replaced by y(tn);

    • ỹn+1 is the approximation of y(tn+1) generated by formulas (3.66),(3.67) and (3.69) in which yn is replaced by y(tn) and j iterationsof (3.19) are performed.

    From relations

    ṽn+1 = y(tn) + h∑

    k

    bkf∗k , h

    k aikf∗k = −y(tn) + ξ∗i , f ∗i = f(tn + cih, ξ∗i ),

    ỹn+1 = y(tn) + h∑

    k

    bkfk, h∑

    k aikfk = −y(tn) + ξi,j,

    we get

    dn,j = h∑

    k

    bk(f∗k − fk), h

    k

    aik(f∗k − fk) = (ξ∗i − ξi,j). (3.71)

    Denoting by (āki) the inverse matrix of A, it follows that

    dn,j =∑

    k

    i

    bkāki(ξ∗i − ξi,j). (3.72)

    Consequently

    ||dn,j||2 ≤ β ||x∗ − xj ||, (3.73)where the constant β depends only on the coefficients bk and aik of theconsidered Runge-Kutta method.

    46

  • The right-hand member of (3.73) can be estimated by using either (3.63)or (3.64), if values for q and r are available.In order to determine these values, denote by Fi(x) the left-hand memberof (2.17) in which yn is replaced by y(tn). Moreover, consider F (x) ∈ Rsm,with subvectors F1(x), . . . , Fm(x) ∈ Rs, and z0 ∈ Dm, with subvectorsy(tn + c1h), . . . , y(tn + cmh) ∈ D.

    Clearly ||F (z0)||2 ≤ γ0 ||y′||suph with a constant γ0 which depends onlyon the matrix A. Results provided in [Dorsselaer & Spijker, 1994] and thislast inequality yields to

    ||x∗ − z0||2 ≤ γ1 ||y′||suph. (3.74)

    Here γ1 depends only on the matrix A and x∗ is the exact solution of

    F (x) = 0.

    Applying (3.69), replacing yn with y(tn), we have

    ||x∗ − w0||2 = ||x∗ − x0||2 ≤ γ ||y′||suph, (3.75)

    where γ = (γ1 +√m) depends only on A. Obviously (3.54) and (3.62) hold

    withq = r = 1 (3.76)

    and O-constant K = γ ||y′||sup which is not affected by stiffness.In view of (3.73) and (3.76) we can apply (3.63) and (3.64) so as to obtain

    ||dn,j||2 = O(hR(j)) with R(j) = 2j + 1 (3.77)

    if stiffness is neglected. Taking stiffness into account, we get

    ||dn,j||2 = O(hQ(j)) with Q(j) = j + 1 (3.78)

    and O-constant not affected by stiffness.

    Spijker then relates the local stopping errors to the global ones. He asso-ciates the theoretical Runge-Kutta process to a function Ψn, and the Runge-Kutta process carried out with j simplified iterations to a function Φn, so asto have

    vn+1 = Ψn(vn, h), ṽn+1 = Ψn(y(tn), h),

    yn+1 = Φn(yn, h), ỹn+1 = Φn(y(tn), h).

    Sincedn,j = Ψn(y(tn), h)− Φn(y(tn), h), (3.79)

    47

  • we have the following relation for the global stopping error:

    Dn+1,j = [Ψn(vn, h)−Ψn(y(tn), h)]+ [Φn(y(tn), h)−Φn(yn, h)]+dn,j. (3.80)

    Let Jn = J(tn, y(tn)). Denote by φ the stability function of the Runge-Kuttamethod, i.e.,

    φ(ζ) = 1 + ζbT (I − ζA)−1u, bT = (b1, . . . , bm), u = (1, . . . , 1)T . (3.81)

    If the Jacobian matrix J(t, ξ) were constant for tn ≤ t ≤ tn+1 and ξ ∈ D, wewould have

    Ψn(vn, h)−Ψn(y(tn), h) = φ(hJn)[vn − y(tn)]Φn(y(tn), h)− Φn(yn, h) = φ(hJn)[y(tn)− yn]

    and therefore (3.80) would reduce to Dn+1,j = φ(hJn)Dn,j + dn,j. Conse-quently, one may expect the vectors D̄n,j, defined by

    D̄n+1,j = φ(hJn)D̄n,j + dn,j for n = 0, . . . , N − 1D̄0,j = 0,

    (3.82)

    to be useful approximations to Dn,j.

    From (3.53) it follows that the logarithmic norm µ(hJn), induced by theEuclidean norm in Rs, is nonpositive5. As a consequence of a Von Neumanntheorem (see [Dekker & Verwer,1984]), we have that, for a