MSc Thesis

Development of a Massively Parallelized

Micromagnetic Simulator on Graphics Processors

A thesis submitted for the degree ofMaster of Science

by

Aamir Ahmed KhanInstitute of Nanoelectronics

Technische Universitat Munchen

October 14, 2010

Thesis Advisors

Prof. Gyorgy Csaba, University of Notre DameProf. Wolfgang Porod, University of Notre Dame

Prof. Paolo Lugli, Technische Universitat Munchen

Dedicated to

my beloved parents

without their love and support, nothing could have been possible.

Acknowledgements

I want to deeply thank my advisor Prof. Gyorgy Csaba, who offered me the opportunity

to do this thesis under his supervision. He had been very kind to me throughout the thesis

and gave his valuable advice and mentoring. Most of the ideas in this thesis are due to his

guidance and suggestions. I also want to thank Prof. Wolfgang Porod, who appointed me

as a visiting researcher in his group at the University of Notre Dame and provided financial

and logistical support for this work. I am also highly obliged to Prof. Paolo Lugli who

served as my official advisor at Technische Universitat Munchen and took keen interest in

my research.

I also cannot forget to mention Heidi Deethardt, Edit Varga, Mirakmal Niyazmatov

and Anand Kumar for helping me throughout my stay at Notre Dame.

ii

Abstract

This thesis describes the development of a high performance micromagnetic simulator

running on massively parallel Graphics Processing Units (GPU’s). We show how all com-

ponents of the simulator can be implemented from scratch. Special attention is given to the

calculation of magnetic field to reduce its running time, which traditionally takes O(N2).

This is also the most computationally intensive part of the simulator and we have shown

its GPU implementation as well. We also have used two different algorithms for field cal-

culation – (1) the traditional direct summation and (2) the fast multipole method (FMM).

Direct summation is better suited for small problems, while large problems are best han-

dled by FMM due to its O(N logN) complexity. When compared to the single-threaded

implementation on CPU, performance benchmarks show a maximum speed-up in excess of

30x while running on GPU.

iii

Contents

1 Introduction 1

1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Overview of the Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Fundamentals of Micromagnetic Simulations 5

2.1 Magnetization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Landau-Lifshitz-Gilbert Equation . . . . . . . . . . . . . . . . . . . . . . . 6

2.3 Effective Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3.1 Exchange Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3.2 External Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3.3 Demagnetization Field . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.4 Components of a Micromagnetic Simulator . . . . . . . . . . . . . . . . . . 8

3 Magnetostatic Field Calculation 10

3.1 Field from Magnetization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2 Field from Magnetic Scalar Potential . . . . . . . . . . . . . . . . . . . . . 11

3.2.1 Magnetic Charges . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.2.2 Magnetic Scalar Potential . . . . . . . . . . . . . . . . . . . . . . . 13

3.2.3 Magnetostatic Field . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.3 Complexity of Field Calculation . . . . . . . . . . . . . . . . . . . . . . . . 15

4 Fast Multipole Method 16

4.1 Mathematical Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.1.1 Harmonic functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.1.2 Mutipole Series for a Single Charge . . . . . . . . . . . . . . . . . . 17

4.1.3 Mutipole Series for a Charge Distribution . . . . . . . . . . . . . . 18

iv

4.1.4 Approximation of Mutipole Series . . . . . . . . . . . . . . . . . . . 19

4.1.5 Spherical Harmonics . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.2 Algorithmic Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.2.1 Hierarchical Subdivisions . . . . . . . . . . . . . . . . . . . . . . . . 20

4.2.2 Clustering of Charges . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.2.3 Far-field Potential . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.2.4 Recursion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.2.5 Near-field Potential . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.3 Pseudocode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.3.1 Recursive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.3.2 Iterative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.4 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.5 Implemention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.5.1 Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.5.2 Custom Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.5.3 Mathematical Functions . . . . . . . . . . . . . . . . . . . . . . . . 27

4.5.4 Extension of 2D Algorithm to 3D . . . . . . . . . . . . . . . . . . . 28

4.6 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.7 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5 Landau-Lifshitz-Gilbert Equation Solver 32

5.1 Euler’s Solver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5.2 4th Order Runge-Kutta Solver . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.2.2 Mathematical Description . . . . . . . . . . . . . . . . . . . . . . . 33

5.3 Adaptive Time-stepping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.3.1 Overview of Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.3.2 Mathematical Description . . . . . . . . . . . . . . . . . . . . . . . 35

6 Simulation Examples 38

6.1 Ordering of Nanowires . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

6.2 Phase Transition of Nanomagnets . . . . . . . . . . . . . . . . . . . . . . . 38

7 Parallelization on GPU 41

7.1 Parallelization of the LLG Solver . . . . . . . . . . . . . . . . . . . . . . . 41

v

7.2 Parallelization of the Direct Summation Method . . . . . . . . . . . . . . . 41

7.3 Paralellization of the FMM . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

7.4 Performance Benchmarks of the Parallelized Code . . . . . . . . . . . . . . 43

8 Conclusion and Future Work 46

vi

List of Figures

1.1 A nanomagnetic logic majority gate. . . . . . . . . . . . . . . . . . . . . . 1

1.2 NVIDIA Tesla GPU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1 Two different magnetization distributions. . . . . . . . . . . . . . . . . . . 5

2.2 Landau-Lifshitz-Gilbert Equation. . . . . . . . . . . . . . . . . . . . . . . . 6

2.3 Flow chart of a micromagnetic simulation. . . . . . . . . . . . . . . . . . . 9

3.1 Summation of magnetic flux density from individual magnetizations. . . . . 12

3.2 Summation of magnetostatic field from magnetization of a bar magnet. . . 13

3.3 Magnetostatic field of a bar magnet through magnetic scalar potential. . . 14

3.4 Magnetostatic field of a U-shaped magnet through magnetic scalar potential. 14

4.1 Single charge and single potential target. . . . . . . . . . . . . . . . . . . . 17

4.2 Multipole coefficients for charge distribution. . . . . . . . . . . . . . . . . . 19

4.3 Truncation error of multipole expansion. . . . . . . . . . . . . . . . . . . . 20

4.4 Subdivision of the 2D computational domain in the FMM algorithm. . . . 22

4.5 FMM tree structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.6 FMM hierarchical indexing based on tree structure. . . . . . . . . . . . . . 27

4.7 Extraction of spatial coordinates with hierarchical indexing. . . . . . . . . 27

4.8 2D FMM for 3D structures. . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.9 Accuracy of FMM compared with direct summation of potential. . . . . . . 30

4.10 Running time for FMM versus direct summation. . . . . . . . . . . . . . . 31

5.1 Euler’s method for solving an IVP. . . . . . . . . . . . . . . . . . . . . . . 33

5.2 4th order Runge-Kutta method. . . . . . . . . . . . . . . . . . . . . . . . . 34

5.3 Time-evolution of torque and time-step. . . . . . . . . . . . . . . . . . . . 36

5.4 Time-evolution of energy (Edemag + Eexch). . . . . . . . . . . . . . . . . . . 37

vii

6.1 Ordering of nanowires. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

6.2 Phase transition between single-domain and vortex state. . . . . . . . . . . 40

7.1 Speed-up of LLG solver on GPU. . . . . . . . . . . . . . . . . . . . . . . . 42

7.2 Parallel reduction algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . 43

7.3 Speed-up of direct summation method on GPU. . . . . . . . . . . . . . . . 44

7.4 Speed-up of FMM on GPU. . . . . . . . . . . . . . . . . . . . . . . . . . . 45

7.5 Running times of direct summation method and FMM on GPU. . . . . . . 45

viii

Chapter 1

Introduction

1.1 Background

Nanomagnetism is a flourishing research field – it generated interest both in the scientific

and the engineering community [1]. Nanomagnets show strong size dependent behaviors

and ordering phenomena, which is interesting for basic physics. Introducing nanomagnets

into silicon chips (Fig. 1.1) may revolutionize microelectronics [2] [3].

Magnetic Force Microscopy Image of Majority Gate Operation

Figure 1.1: A nanomagnetic logic majority gate [2].

Ferromagnetic behavior of large magnets is usually described by hysteresis behavior,

which is the signature of the average behavior of a large number of magnetic domains. If

magnets are made smaller and their size becomes comparable to the natural domain size

1

(typically this happens if the magnets are around few hundred nanometers in size) then the

domain patterns must exactly be calculated in order to understand magnetic behaviors.

This can be done by applying micromagnetic theory.

The central task of micromagnetics is to calculate the dynamics and the steady states

of the M(r, t) magnetization distribution. The dynamics itself is described by the Landau-

Lifshitz equation and this equation contains all magnetic fields generated by the sample,

the external field and magnetic fields that account for quantum mechanical phenomena.

Micromagnetic theory ignores quantum mechanics but it is proven to be working very well

on samples larger than a few nanometers in size.

Microelectronic calculations are very time consuming. A magnetic sample can be sev-

eral micrometers large, but to catch the details of domain patterns, it has to be discretized

into very small (few nm) size mesh elements. The Landau-Lifshitz equation is a strongly

nonlinear, time-dependent partial differential equation (PDE) and to solve it at each iter-

ation step, one has to recalculate the distribution of all magnetic fields. On a desktop PC

one can run simulation on a few micron size magnets. Simulating few tens of micron size

structures is possible but takes weeks and practically unusable for any engineering purpose.

Using supercomputers does not accelerate calculations significantly, since the calculation

is difficult to decompose into independent domains for the solution.

1.2 Motivation

General Purpose Processing on Graphics Processing Units (GPGPU) is a recent develop-

ment in computing. Graphics processors (GPU’s) were around for a long time, but were

used almost exclusively to speed up graphics rendering of demanding computer games.

The kind of instructions they perform are quite limited, but a large number of processors

work on the same task (i.e. rendering a particular part of the image). In many senses,

rendering is very similar to the basic, repetitive computations that is required in most of

the scientific computations. Graphics card manufacturers (among them NVIDIA probably

the first time) realized the potential of their hardware for scientific purposes and developed

tools to ease general purpose programming on their hardware [4].

GPU accelerated computing is a flourishing area of scientific computing. Part of its

success is the excellent price/performance ratio on GPU chips: for around $2000, one can

buy 512 computing cores, which is orders of magnitude better deal than multi-core CPUs.

The cores can communicate with each other with extremely high-speed internal channels,

2

Figure 1.2: NVIDIA Tesla GPU.

which makes it possible to parallelize applications with a lot of communication between

computational domains.

The above arguments make clear the motivation to develop a GPU-accelerated micro-

magnetic simulator. This thesis describes our effort to do that.

1.3 Overview of the Work

We decided to build our simulator from scratch, since it was difficult to understand and

paralellize a code that was not developed in-house. The next two chapters of this thesis

describe the micromagnetic background required to build this simulator.

The most time consuming part of a micromagnetic calculation is to determine the

magnetic field (or magnetic scalar potential) that a sample generates. Most of our thesis is

devoted to this problem. First we show a direct, brute force approach for the calculation

and then a very complicated, state of art method – the fast multipole method (FMM) in

chapter 4.

The dynamics of a micromagnetic sample are governed by the Landau-Lifshitz equa-

tions. We describe how to build a solver to solve a system of millions of these differential

equations in chapter 5.

We then show sample problems that demonstrate our code working correctly (chapter

6). We parallelize the code for both the direct method and the FMM and benchmark the

speed-ups (chapter 7). We demonstrate that for simple problems, the direct method is

3

quite efficient, since it is easy to paralellize and carries almost no overhead. For larger size

problems (mesh elements > 105), the FMM-based methods overtakes the direct calculation.

Only very recently one other group demonstrated results on GPU-accelerated solvers,

based on similar principles [5]. We believe that our work is one of the first attempts to

realize a high-performance, GPU-based complete micromagnetic simulator. Although the

code is still unoptimized, the potential of a scalable, massively parallelized code is already

apparent. We hope it will become a design tool for future nanomagnetic computing devices.

4

Chapter 2

Fundamentals of Micromagnetic

Simulations

2.1 Magnetization

We discretize the 3-dimensional magnetic material into a computational mesh of tiny mesh

elements with dimensions ∆x,∆y,∆z. Each mesh element at position r is represented by

a magnetic moment called magnetization M(r). The moments originate from the electron

spins of respective atoms in the material and their collective behaviour is what makes the

material ferromagnetic, diamagnetic or paramagnetic. All of these moments have same

magnitude, termed as saturation magnetization (Ms) of the material, but differing only in

their orientation (Fig. 2.1). In micromagnetics, we are interested in the steady state as

well as transient behavior of individual moments in response to different magnetic fields.

Figure 2.1: Two different magnetization distributions.

5

2.2 Landau-Lifshitz-Gilbert Equation

The equation that governs the dynamics of magnetization distribution in a magnetic ma-

terial is due to Lev Landau, and Evgeny Lifshitz and T. L. Gilbert and is called Landau-

Lifshitz-Gilbert (LLG) equation.

∂

∂tM(r, t) = −γM(r, t)×Heff(r, t)− αγ

Ms

M(r, t)×(M(r, t)×Heff(r, t)

)(2.1)

Here γ is the electron gyromagnetic ratio, α is the damping constant and Heff is the

effective magnetic field acting on the magnetization. The effect of LLG equation is to align

or anti-align the magnetization vector with the effective field by going through a precession

motion as shown in Fig. 2.2.

tM

effHM

effH

M

)( effHMM

Figure 2.2: Landau-Lifshitz-Gilbert Equation.

6

2.3 Effective Field

The goal of a micromagnetic simulation is to compute the time-evolution of an M(r, t)

magnetization distribution by solving Landau-Lifshitz equation. The change in the mag-

netization is due to the M ×Heff torque generated by the effective magnetic field, which

is described as

Heff = Hexch + Hdemag + Hext + Htherm + ... (2.2)

Any other physical phenomenon can be incorporated into the simulation by adding the

respective magentic field component to the effective field. Calculation of this effective field

is the most essential component of any micromagnetic simulation. In the following sections,

the most important and frequently used components of effective fields are described.

2.3.1 Exchange Field

Exchange field represents the quantum-mechanical forces among the magnetization vectors

in a quasi-classical way. It is responsible for aligning the magnetization vectors with each

other and is defined as [1],

Hexch(r) =2Aexch

µ0M2s

∇2 M(r), (2.3)

where µ0 is the magnetic permeability of free space, Aexch is a material parameter

called the exchange stiffness and ∇2 is the vector Laplacian operator. In the case of cubic

discretization of computational mesh (∆x = ∆y = ∆z), it can also be calculated as the

sum of neighboring magnetizations1.

Hexch(r) =2Aexch

µ0M2s

1

∆x2

∑ri∈neighbors

M(ri) (2.4)

However in both the equations, the calculation of exchange interaction remains local,

O(N), since it only depends upon the magnetizations in its immediate neighborhood.

2.3.2 External Field

This component Hext represents any external magnetic field to which the magnetization

distribution is exposed to. It is just an additive component to the effective field and is not

1 The exchange field is not uniquely defined: for example one can add any vector parallel to M withoutchanging the torque in Eq. (2.1)

7

a function of any neighboring magnetization vectors, thus also scales with O(N).

2.3.3 Demagnetization Field

The demagnetization field, also called as magnetostatic field, describes the long-range

magnetic field created by the M(r) distribution. The evaluation of demagnetization field is

most time consuming part of the calculation since it represents all the pairwise interactions

between magnetization vectors and requires O(N2) steps. As already discussed, all the

other steps are local and scale with O(N). The calculation of demagnetization field is so

involved that it is treated in a separate chapter (Ch. 3).

2.4 Components of a Micromagnetic Simulator

There are a number of publications describing the construction of a micromagnetic simu-

lator [1]. Fig. 2.3 summarizes the main steps of the simulation flow.

We start with the magnetization state M(r, t) and use it to calculate the demagnetiza-

tion and exchange fields, which are then added to any external field and other components

to get the effective field Heff . The effective field is then used to calculate M×Heff torque

on each magnetization vector and the new magnetization state M(r, t + ∆t) according to

the LLG equation. The above steps are then repeated for each time-step until required or

there is not sufficient dynamic activity in the magnetization state.

8

volume

)(

i

Vi

i

r rrrM

)(rneighbors

)(i

ir

rM

4th order Runge-Kutta step

)(r

)(eff rH

),( ttrM

)(demag rH

)(exch rH

)(r),( trM

)0,( trM

extH

),( FttrMFinal magnetization state

Initial magnetization state

Figure 2.3: Flow chart of a micromagnetic simulation.

9

Chapter 3

Magnetostatic Field Calculation

Calculation of effective field is the most essential component of a micromagnetic simulation,

and the component of effective field (Eq. (2.2)) which is the most time consuming to

calculate is magnetostatic field, also called as demagnetizing field. This is because this field

describes the long-range magnetostatic field created by the magnetization distribution and

that it has to be calculated by considering all pairwise interactions among all the mesh

points in the computational domain.

Our goal is to calculate magnetostatic field, Hdemag(r), from the given magnetization

distribution, M(r). The magnetization is given on a 3-dimensional computational mesh

for each discretized mesh element of dimensions ∆x,∆y,∆z.

In this chapter, we describe two schemes for the calculcation of magnetostatic field.

3.1 Field from Magnetization

The field is calculated in two steps. First we calculate the magnetic flux density (B) from

the individual magnetization vectors by summing them up over the entire volume.

B(r) =µ0

4π

∑ri∈volume

(3 {M(ri) · (r− ri)} (r− ri)

|r− ri|5− M(ri)

|r− ri|3

)∆x∆y∆z (3.1)

From the flux density, the magnetostatic field can be calculated using the straightfor-

10

ward relation.

B = µ0(Hdemag + M) (3.2)

Hdemag =B

µ0

−M (3.3)

Fig. 3.1 and Fig. 3.2 show two examples of calculating field using this approach.

3.2 Field from Magnetic Scalar Potential

There is another method for the calculation of magnetostatic field which requires the calcu-

lation of magnetic scalar potential first. This method is longer but has certain advantages

which will be discussed in section 3.3. The method is outlined in Fig. 3.3 and 3.4 and is

discussed below.

3.2.1 Magnetic Charges

The calculation of magnetostatic field starts by calculating the magnetic volume charge

density over the entire mesh using the divergence operation [6].

ρ(r) = −∇ ·M(r) (3.4)

= −(∂Mx

∂x+∂My

∂y+∂Mz

∂z

)(3.5)

The finite difference counterpart of this operation is given as

ρ(r) = − Mx(x+ ∆x, y, z)−Mx(x−∆x, y, z)

2∆x

− My(x, y + ∆y, z)−My(x, y −∆y, z)

2∆y

− Mz(x, y, z + ∆z)−Mz(x, y, z −∆z)

2∆z(3.6)

For magnetic field calculations, one generally has to take into account not only volume

charges but also surface charges, which appear at the boundaries of magnetic bodies and

where Eq. (3.6) cannot be interpreted due to the discontinuity of the magnetization. To

keep the algorithm simple, we do not explicitly consider surface charges, but instead we

interpret Eq. (3.6) over the entire volume. This ‘smears’ an equivalent volume charge on

11

Figure 3.1: Summation of magnetic flux density from individual magnetizations.

12

(a) (b) (c)

Figure 3.2: Summation of magnetostatic field of a bar magnet from individual magnetiza-tions. (a) Magnetization distribution M. (b) Flux density B due a single magnetizationvector. (c) Magnetostatic field Hdemag.

the cells adjacent to the surface. There is some error in the magnetic field at the boundaries

of magnetized regions, but this error becomes negligible on the neighboring mesh points.

3.2.2 Magnetic Scalar Potential

From the magnetic charges, we evaluate the magnetic scalar potential at each mesh point

using the follwing expression, analogous to Coulombic and Newtonian 1r

potential.

Φ(r) =1

4π

∑ri∈volume

ρ(ri)

|r− ri|·∆x∆y∆z (3.7)

3.2.3 Magnetostatic Field

Once we have the potential, we can calculate the field as its gradient.

Hdemag(r) = −∇Φ(r) (3.8)

=

[−∂Φ

∂x, −∂Φ

∂y, −∂Φ

∂z

](3.9)

13

The finite difference counterpart of this operation is given as

Hdemag, x(r) = −Φ(x+ ∆x, y, z)− Φ(x−∆x, y, z)

2∆x

Hdemag, y(r) = −Φ(x, y + ∆y, z)− Φ(x, y −∆y, z)

2∆y(3.10)

Hdemag, z(r) = −Φ(x, y, z + ∆z)− Φ(x, y, z −∆z)

2∆z

The finite difference calculation of divergence (Eq. (3.6)) and gradient (Eq. (3.10))

requires that there be at least one padding layer around the magnetization distribution

(i.e. cells where M = 0). This does not pose any serious problem however.

(a) (b) (c) (d)

Figure 3.3: Magnetostatic field of a bar magnet through magnetic scalar potential. (a) Mag-netization distribution M. (b) Magnetic volume charge density ρ. (c) Magnetic scalarpotential Φ. (c) Magnetostatic field Hdemag.

(a) (b) (c) (d)

Figure 3.4: Magnetostatic field of a U-shaped magnet through magnetic scalar potential.(a) Magnetization distribution M. (b) Magnetic volume charge density ρ. (c) Magneticscalar potential Φ. (c) Magnetostatic field Hdemag.

14

3.3 Complexity of Field Calculation

Clearly, the most time consuming step of field calculation is the evaluation of Eq. (3.1) for

magnetization method or Eq. (3.7) for the potential method. It represents all the pairwise

interactions between magnetization vectors and requires O(N2) steps. Except it, all the

other calculations are local and scale with O(N).

A complexity of O(N2) is prohibitively huge for large-sized problems. We definitely

need a fast method for this step to keep the simulation running time within reasonable

limits for any appreciable problem size. The advantage of calculating field through the

longer route of scalar potential is that potential theory is a well-established theory in

mathematical physics. Several fast methods of calculating pairwise interactions have been

devised such as Fast Fourier Transform (FFT), Fast Multipole Method (FMM), etc. These

fast methods reduce the complexity of from O(N2) to O(N logN).

We preferred FMM over FFT because it can further reduce the complexity to O(N).

Also unlike FFT, FMM does not force us to use periodic boundary conditions. Periodic

boundary conditions are generally undesirable because they necessitate the use of large

padding areas around the physical structure to be simulated. Although FMM is an ap-

proximate method of potential calculation, it does not hurt us because we are anyway

doing numerical simulations and do not require further accuracy than what is afforded by

floating point standards. FMM is described in detail in chapter 4.

15

Chapter 4

Fast Multipole Method

Fast multipole method (FMM) [7] is a powerful mathematical technique to efficiently calcu-

late pairwise interactions in physical systems, matrix-vector multiplication and summation

[8]. It can do so in O(N logN) or even better O(N) time instead of traditional O(N2)

while sacrificing some accuracy. It is regarded as among the top-10 algorithms of the 20th

century [9] and has already been exploited for micromagnetic simulations [10] [5].

4.1 Mathematical Description

In the context of potential theory, FMM enables to efficiently perform the summation of

Eq. (3.7). The underlying idea is that instead of evaluating all pairwise interactions which

would require O(N2) steps, the charges can be lumped together and the far field of this

cluster at any far-lying point in space can be approximated with some acceptable accuracy,

regardless of the distribution of charges in the cluster.

4.1.1 Harmonic functions

FMM is applicable to a particular class of fucntions called harmonics functions. These are

the solution of Laplace equation

∇2Φ =∂2Φ

∂x2+∂2Φ

∂y2+∂2Φ

∂z2= 0 (4.1)

For Coulombic and Newtonian potentials, the corresponding harmonic functions that

16

satisfies Eq. (4.1) are of the form

Φ =q√

x2 + y2 + z2(4.2)

4.1.2 Mutipole Series for a Single Charge

0r

0rr

0q

Figure 4.1: Single charge and single potential target.

To derive a series expansion for Eq. (4.2), we have to resort to spherical coordinates.

Consider a charge q0 at point r0(r0, θ0, φ0). The potential from this charge at any point

r(r, θ, φ) is given by

Φ(r) =q0

|r − r0|(4.3)

Φ(r) =q0√

r2 + r20 − 2rr0cosγ

(4.4)

We are only interested to form a series expansion in the region |r| > |r0|, so we can

factor out r in the denominator

Φ(r) =q0

r√

2 +(r0r

)2 − 2(r0r

)cosγ

, (4.5)

where γ is the angle between r and r0. The radicand in the above equation is an infinite

series in r0r

with Legendre functions, Pl(cosγ), as its coefficients [11].

Φ(r) =q0

r

∞∑l=0

Pl(cosγ)(r0

r

)l(4.6)

Φ(r) = q0

∞∑l=0

Pl(cosγ)rl0rl+1

(4.7)

17

The above equation contains γ which depends upon both r and r0. Our goal is to factor

out these dependencies. Towards that end, we need a result from classical mathematics,

called the addition theorem for spherical harmonics [11].

Pl(cosγ) =l∑

m=−l

Y ml (θ, φ)Y m∗

l (θ0, φ0), (4.8)

where Y ml (θ, φ) are the spherical harmonics described in section 4.1.5. Eq. (4.7) then

takes the form of a multipole series.

Φ(r) = q0

∞∑l=0

l∑m=−l

Y ml (θ, φ)Y m∗

l (θ0, φ0)rl0rl+1

(4.9)

Φ(r) =∞∑l=0

l∑m=−l

Mml

rl+1Y ml (θ, φ), (4.10)

where Mml are the multipole coefficients defined as

Mml = q0r

l0Y

m∗l (θ0, φ0) (4.11)

Eq. (4.10), the multipole series, has now separate factors depending upon r(r, θ, φ) and

r0(r0, θ0, φ0), which is the mathematical basis for fast multipole method.

4.1.3 Mutipole Series for a Charge Distribution

The multipole coefficients can further be extended to represent distribution of charges

instead of just a single charge. Suppose we have K chages distributed inside a sphere of

radius ρ as shown in Fig. 4.2. Then Eq. (4.10) still represents the potential at any point

in the region |r| > ρ but with general multipole coefficients

Mml =

K∑i=1

qirliY

m∗l (θi, φi) (4.12)

18

)max( ir

rmlM

)max( ir

2rr2q

0r0q 1r

1q

Figure 4.2: Multipole coefficients for charge distribution.

4.1.4 Approximation of Mutipole Series

For the multipole coefficients described by Eq. (4.12) and multipole series of Eq. (4.10),

we can truncate the series at l = P , for an error bound of∣∣∣∣∣∞∑l=0

l∑m=−l

Mml

rl+1Y ml (θ, φ)−

P∑l=0

l∑m=−l

Mml

rl+1Y ml (θ, φ)

∣∣∣∣∣ = ε ≤∑K

i=1 |qi|r − ρ

(ρr

)P+1

(4.13)

It can be clearly seen that as we move to points farther away from the charge distribution

(r > ρ), the error falls rapidly. The rate of error decay w.r.t. P is given by [7].

dε

dP≈(√

3)−P

(4.14)

Thus the error can be made arbitrarily small at the expense of more computations,

but usually P = 3 gives sufficient accuracy for micromagnetic calculations [10] as shown

in Fig. 4.3.

4.1.5 Spherical Harmonics

Spherical harmonics are the function of two variables, θ and φ, on the surface of a sphere

and are defined in terms of associated Legendre functions [11].

Y ml (θ, φ) =

√(l − |m|)!(l + |m|)!

P|m|l (cosθ)ejmφ (4.15)

The associated Legendre functions, Pml (x), in turn are defined in terms of Legendre

19

Monopole Dipole Quadrupole Octupole Sexdecuple0

−5

0−4

0−3

0−2

0−1

FMM relative error (RMS)

P = 0 P = 1 P = 2 P = 3 P = 4

Figure 4.3: Truncation error of multipole expansion. P = 3 gives sufficient accuracy.

functions, Pl(x), by the Rodrigues’ formula [11].

Pml (x) = (−1)m(1− x2)m/2

dm

dxmPl(x) (4.16)

Pl(x) =1

2ll!

dl

dxl(x2 − 1)l (4.17)

4.2 Algorithmic Description

In this section, FMM algorithm in 2-dimensions will be described using the equations of

section 4.1. We are given a 2-D computational domain with N mesh points, where each

point can possibly contain a charge. Our goal is to calculate potential at each mesh point

in the domain.

4.2.1 Hierarchical Subdivisions

We recursively subdivide the computational domain into 4 equal parts until it reaches the

refinement level of individual mesh points to get a hierarchically subdivided computational

domain. In formal terms, we start with the whole domain, let’s call it refinement level 0.

Then we subdivide it into 4 equal subdomains to get to level 1. We keep subdividing the

20

existing subdivisions at each level until we reach at level log4N , after which no further

subdivisions are possible since we have reached the refinement level of individual mesh

points. Note that at refinement level L, there are 4L FMM cells, each containing N4L mesh

points.

4.2.2 Clustering of Charges

We index mesh points inside a FMM cell with variable i and let the charge at mesh point

i be qi. We also define ri(ri, θi, φi) as a vector from center of the FMM cell to the mesh

point i. We form clusters of charges inside each FMM cell and represent them by multipole

coefficients. The multipole coefficients (Mml ) for a given FMM cell are calculated from the

charge distribution inside the cell using Eq. (4.12).

4.2.3 Far-field Potential

Let’s call the FMM cell containing the cluster of charges under consideration as the source

cell and define r(r, θ, φ) as a vector from center of the source cell to any point in space. The

multipole coefficients, representing cluster of charges in the source cell, can now be used to

evaluate potential contribution from their cluster to any point r in the region that excludes

the extent of source cell. From Fig. 4.4, it can be seen that this region includes all the

FMM cells in the computational domain except those that are immediately neighboring

the source cell. We define a subset of this region, let’s call it the interaction region, as

the children of the neighbors of the source cell’s parent excluding source cell’s immediate

neighbors. See Fig. 4.4 for these topological relations. For any source cell at any refinement

level in the computational domain, there can be at most 27 interaction cells in such an

interaction region.

We limit the evaluation of potential contributions only to these interaction cells, because

FMM cells exterior to this interaction region have already been taken into account during

the previous coarser refinement levels. Eq. (4.10) can now be used to evaluate potential

contributions to all the mesh points of these interaction cells.

4.2.4 Recursion

Potential contributions to the immediate neighbors of the source cell cannot be evaluated

using Eq. (4.10) because part of these neighboring cells lie within the extent of source cell,

21

Figure 4.4: Subdivision of the 2D computational domain in the FMM algorithm. (a)Potential evaluation at level L from one cell. (b) Potential evaluation at level L+ 1 fromone of the child cells of active cells in the previous step.

where multipole expansion is not valid. Contributions to these cells will be evaluated in the

next finer level of refinement and this is where recursion comes into the FMM algorithm.

4.2.5 Near-field Potential

At the last refinement level (log4N), there are no finer refinement levels to take care for the

source cell’s neighbors. Potential contributions to them now have to be evaluated directly

according to Eq. (4.3), which is the only stage of the FMM algorithm that uses direct

pairwise evaluations. But since each source cell can have a maximum of 8 neighbors, only

8N such evaluations are required at the finest level and thus scale with O(N).

22

4.3 Pseudocode

4.3.1 Recursive

Recursive pseucode, which is closer to the given description of FMM, is given below.

function FMM(source_cell)

CALCULATE_MULTIPOLE_COEFFICIENTS(source_cell)

for interaction_cell = 1:27

EVALUATE_POTENTIAL_BY_MULTIPOLE SERIES(interaction_cell)

end

if source_cell.level == log_4(N)

for neighboring_cell = 1:8

EVALUATE_POTENTIAL_DIRECTLY(neighboring_cell)

end

else

SUBDIVIDE_IN_4(source_cell)

for c = 1:4

FMM(source_cell.child[c]) /* Recursion */

end

end

end

23

4.3.2 Iterative

Iterative pseucode, which helps in complexity analysis, is given below.

for level = 0:log_4(N)

for source_cell = 1:4^level

CALCULATE_MULTIPOLE_COEFFICIENTS(source_cell)

for interaction_cell = 1:27

EVALUATE_POTENTIAL_BY_MULTIPOLE SERIES(interaction_cell)

end

end

end

for level = log_4(N)

for source_cell = 1:4^level

for neighboring_cell = 1:8

EVALUATE_POTENTIAL_DIRECTLY(neighboring_cell)

end

end

end

4.4 Complexity Analysis

Refer to the iterative pseucode from section 4.3.2. We determine the complexity, first

for each source cell, then for each level and finally for the whole computational domain.

Consider a source cell at refinement level L containing N4L mesh points. For multipole series

truncation at l = P , we need (P + 1)2 multipole coefficients. So the number of operations

required to calculate all multipole coefficients is given by

(P + 1)2N

4L(4.18)

Similarly the number of operations required to evaluate multipole expansions for N4L

mesh points in 27 interaction cells is given by

27(P + 1)2N

4L(4.19)

There are 4L cells at refinement level L, so the number of operations used during level

24

L is given by

4L∑s=1

((P + 1)2N

4L+ 27(P + 1)2N

4L

)(4.20)

= 4L(

28(P + 1)2N

4L

)(4.21)

= 28(P + 1)2N (4.22)

Finally there are (log4N+1) refinement levels in the FMM algorithm, hence the number

of operations used during all the levels is given by

log4N∑L=0

28(P + 1)2N (4.23)

= 28(P + 1)2N log4N (4.24)

An additional 8N operations are required to calculate direct pairwise potential cotri-

butions at the finest level. Thus the total instruction count for FMM algorithm is given

by

28(P + 1)2N log4N + 8N (4.25)

Hence the asymptotic running time (complexity) of FMM algorithm is

O(28(P + 1)2N log4N + 8N

)(4.26)

= O (N logN) (4.27)

The reason for this reduced complexity as compared to O(N2) for direct summation is

that the interactions between far-lying mesh points are evaluated using multipole expan-

sions on large subdomains. Interactions between closer-lying regions are also accounted

for with multipole expansions at finer levels of hierarchy and using smaller subdomains for

the multipole expansion. And only potential contributions from immediate neighbors at

the finest level are calculated with direct pairwise calculation.

25

4.5 Implemention

The FMM algorithm requires a lot of book-keeping and is much harder to implement than

direct pairwise summation or Fast Fourier Transform (FFT).

4.5.1 Data Structures

Since we use FMM to calculate potential, it is invoked at least once (upto four times for 4th

order Runge-Kutta, see chapter 5) during each time iteration of a dynamic simulation. To

save running time during thousands of FMM invocations for neighbor finding, interaction

list building, and other topological operations, we save all the hierarchical subdivisions

along with their topological relations in a tree structure in memory. Since none of the

topological relations can change over time, this scheme is expected to save running time

by reducing computations at the expense of using more memory.

Level 0

Level 1

Level 2

Level 3

Tree

....

....

....

....

Figure 4.5: FMM tree structure.

Fig. 4.5 shows first few levels of the tree structure. Each node represents a subdivi-

sion (FMM cell) in the computational domain. The children of a node represent further

subdivisions at the next finer level. From this tree structure, there is no obvious way to

find neighbors and to build the interaction list. It is accomplished by using hierarchical

indexing scheme [8] to number each FMM cell as shown in Fig. 4.6. This indexing scheme

can keep track of spatial coordinates (x, y) of each cell as shown in Fig. 4.7 and hence can

extract topological relations among subdivisions.

26

0 1

2 3

0 1

2 3

0 1

2 3

0 1

2 3

0 1

2 3

0 1

2 3

0 1

2 3

0 1

2 30 1

2 3

0 1

2 3

0 1

2 3

0 1

2 3

0 1

2 3

0 1

2 3

0 1

2 3

0 1

2 3

0 1

2 3

0 1

2 3

0 1

2 3

0 1

2 3

0 1

2 3

0 1

2 3

0 1

2 3

0 1

2 3

Level 0 Level 1 Level 2 Level 3

Figure 4.6: FMM hierarchical indexing based on tree structure.

0312)4 = 00 11 01 10)2

0110)2 , 0101)2 = (6,5) = (x,y)

Bit-interleaving

0 1

2 3

0 1

2 3

0 1

2 3

0 1

2 3

0 1

2 3

0 1

2 3

0 1

2 3

0 1

2 30 1

2 3

0 1

2 3

0 1

2 3

0 1

2 3

0 1

2 3

0 1

2 3

0 1

2 3

0 1

2 3

0 1

2 3

0 1

2 3

0 1

2 3

0 1

2 3

0 1

2 3

0 1

2 3

0 1

2 3

0 1

2 3

0)4 03)4 031)4 0312)4

0 1 2 3 4 5 6 7

0

1

2

3

4

5

6

7

y

x

Level 0 Level 1 Level 2 Level 3

Figure 4.7: Extraction of spatial coordinates with hierarchical indexing.

4.5.2 Custom Data Types

Since the FMM algorithm makes heavy use of 3D vectors and complex numbers, we imple-

mented custom data types (C++ classes) to effeciently represent these quantities and to

write readable code. Examples of comlex quantities are multipole coefficients and spherical

harmonics, while that of 3D vectors are the space vectors to represent mesh points in the

computational domain.

4.5.3 Mathematical Functions

The associated Legendre functions, P|m|l (x), are implemented as a lookup table in l and m

but computed in x. This is because x is a continuous variable in the interval [−1, 1] while

27

l and m take on only discrete values from a small finite set {l ×m}, where

l = {0, 1, 2, . . . , P} (4.28)

m = {0, 1, 2, . . . , l} (4.29)

For the usual case of P = 3,

|{l ×m}| =P∑l=0

l∑m=0

(1) (4.30)

=(P + 1)(P + 2)

2(4.31)

= 10 (4.32)

The factorial function is also implemented as a lookup table because the maximum

number for which we ever need a factorial is [max(l + |m|)] = 2P , which for P = 3

is just 6.

4.5.4 Extension of 2D Algorithm to 3D

=

=

Figure 4.8: 2D FMM for 3D structures.

As nanomagnet logic devices (the first target of this simulator) are planar structures,

we implemented the FMM algorithm in 2D and extended it to 3D in a brute force manner.

The structure is divided as a set of planar layers, and the hierarchical subdivisions are

applied separately to each layer in turn. The potential contributions from the source cell

of the current layer to the interaction cells of all the layers are evaluated and thus summed

up to the 3D volume as shown in Fig. 4.8.

28

This method scales with O(N2layerNplane logNplane), where Nlayer is the number of layers

and Nplane is the number of mesh elements per layer. For typical structures of concern,

nanomagnetic logic structures for example, Nlayer ≈ 10 and Nplane ≈ 106. Thus this seems

to be an optimal solution.

4.6 Accuracy

As shown in Fig. 4.9, the accuracy of FMM algorithm is of the order of 10−4. Even the

maximum error is 10−3 for P = 3, which serves our purpose well. As can be noticed from

Fig. 4.9(d), the error is maximum or minimum at the corners of the hierarchical FMM

cells. This is due the fact that multipole expansion is more accurate at larger distances

from the charge cluster than at nearer points.

4.7 Performance

The FMM algorithm carries a significant overhead. It is time-consuming to calculate the

(P + 1)2 multipole coefficients for each subdivision and then to use series expansion to

evaluate potential contributions. It requires careful indexing and book-keeping to consider

all interactions exactly once and exactly at the appropriate level of hierarchy. Thus the

FMM is justified only for relatively large problems where its O(N logN) running time pays

off while that of direct calculation (O(N2)) becomes prohibitively huge (see Fig. 4.10).

29

Potential from exact calculation

20

40

60

80

100

Potential from FMM algorithm

20

40

60

80

100

Charge distribution

2

4

6

8

Relative Error (RMS = 2.1e−04)

2

4

6

8

10

12

x 10−4

(a) (b)

(c) (d)

Figure 4.9: Accuracy of FMM compared with direct summation of potential. (a) Randomcharge distribution. (b) Potential from direct summation. (c) Potential from FMM. (d)Relative error between the two potentials.

30

Figure 4.10: Running time for FMM versus direct summation. FMM runs faster forproblems larger than 2 · 104 mesh points due to its O(N logN) complexity.

31

Chapter 5

Landau-Lifshitz-Gilbert Equation

Solver

As mentioned already in section 2.2, LLG equation governs the time-evolution of magne-

tization vectors in a magnetic material. The equation is repeated here for convenience.

∂

∂tM(r, t) = −γM(r, t)×Heff(r, t)− αγ

Ms

M(r, t)×(M(r, t)×Heff(r, t)

)(5.1)

where all the quantities in LLG equation have already been described in section 2.1

and 2.2.

LLG equation is a system of non-linear partial differential equations. But since all the

space derivates are embedded in the the Laplacian, divergence and gradient operations

which are used only during the calculation of effective field (Heff) separately, we treat the

LLG equation as a system of ordinary differential equations (ODE’s) in time and solve it

as any other initial value problem (IVP).

5.1 Euler’s Solver

For an IVP, the value of dependent variable (M) at the next time instant can be calculated

from its value and derivative at current time using the straightforward point-slope formula.

M(r, t+ ∆t) = M(r, t) +∂

∂tM(r, t) ·∆t (5.2)

Eq. (5.2) is called the Euler’s method for solving IVP and is illustrated in Fig. 5.1. It

32

is applied at each time-step to get the value for the next time-instant, and that’s why it is

also sometimes termed as time-marching method.

t t+∆t

M(t+∆t)

M(t)

∂∂tM(t)

Figure 5.1: Euler’s method for solving an IVP.

5.2 4th Order Runge-Kutta Solver

5.2.1 Overview

We chose to use the 4th order Runge-Kutta method for solving LLG equation due to its

much better numerical stability. The basic idea is that instead of using just one slope

to estimate the value at the next time instant, we use four slopes and use their weighted

average for the purpose. That’s why the method is called 4th order and although it requires

4 updates to the effective field for a single time-step, still it behaves better and more stable

than the Euler’s method with a 4 times smaller time-step.

5.2.2 Mathematical Description

Let us denote the right hand side of Eq. (5.1) by

∂

∂tM(r, t) = f(M(r, t), t) (5.3)

Then according to the Runge-Kutta method, the value of M at the next time instant

33

can be calculated from its value at the current time and an effective slope (k).

M(r, t+ ∆t) = M(r, t) + k ·∆t (5.4)

k =k1

6+

k2

3+

k3

3+

k4

6(5.5)

where k1,k2,k3 and k4 are the slopes at different points along the interval [t, t + ∆t]

as shown in Fig. 5.2 and defined below.

k1 = f(M(r, t), t

)(5.6)

k2 = f(M(r, t) + k1

∆t

2, t+

∆t

2

)(5.7)

k3 = f(M(r, t) + k2

∆t

2, t+

∆t

2

)(5.8)

k4 = f(M(r, t) + k3∆t, t+ ∆t

)(5.9)

t

k1

k2

k3

k4

k1

k3

k2

t+∆t∆t2t+

k

M(t+∆t)

M(t)

Figure 5.2: 4th order Runge-Kutta method. The slope k is the weighted average ofk1,k2,k3 and k4.

The above steps of Runge-Kutta method are applied at each time-step to evolve the

magnetization, M(r, t), over time.

34

5.3 Adaptive Time-stepping

The choice of the time-step (∆t) in IVP solvers has significant impact on their accuracy,

stability and running time. The simplest approach is to always use a fixed and conservative

time-step. Though very stable, this approach wastes so much of the precious running time

of the simulation by not making large strides during the periods of low dynamic activity.

Usually, one is only intersted in capturing fine details during the periods of high dynamic

activity, while low dynamic activity periods are desired to be captured with less details to

save running time as well as storage requirements.

5.3.1 Overview of Algorithm

We use a simple method to decide an optimal time-step on the fly. We monitor the

maximum torque on magnetization vectors in the material and use this information to

scale up or down the time-step accordingly. If the maximum torque is growing, which

indicates that the system is approaching a period of high dynamic activity, we reduce

the time-step in order to capture finer details of time-evolution of magnetization. On

the other hand, if the maximum torque is shrinking, which indicates that the system is

approaching a period of low dynamic activity, we increase the time-step in order to make

a few large strides to quickly pass through the periods of less interesting time-evolution of

magnetization.

5.3.2 Mathematical Description

At each time-step, we sample a quantity from the system that is the represntative of torque

on magnetization vectors in the material. Let this quanity be called τ .

τ(r, t) =M(r, t)×Heff(r, t)

M2s

(5.10)

where Ms is the saturation magnetization of the material. We then take the maximum

out of all the torques in the material,

τmax(t) = maxr

τ(r, t), (5.11)

35

and use this quantity to scale the time-step.

∆t(t) =k

τmax(t)(5.12)

where k is a parameter in the range of 10−14 to 10−12. The inverse relation between ∆t

and τmax(t) is also depicted in Fig. 5.3.

Figure 5.3: Time-evolution of torque and time-step.

When using Eq. (5.12) to choose the time-step, sometimes it so happens that spurious

oscillations are introduced into the system when it is close to settling down. To prevent

these artifacts we also monitor the magnetostatic and exchange energies at each time step.

Edemag(t) = −∫

volume

M(r, t) ·Hdemag(r, t) dr (5.13)

Eexch(t) = −∫

volume

M(r, t) ·Hexch(r, t) dr (5.14)

Since the sum of these energies must always descrease over time (Fig. 5.4), we reject

any time-step that would do otherwise. When this happens, we then recompute it with

a forcibly reduced conservative time-step. This is similar to the criterion used in the well

established OOMMF code [12].

36

0 0.2 0.4 0.6 0.8 1

x 10−9

650

700

750

800

850

900

950

1000

time

Energy (eV)

Figure 5.4: Time-evolution of energy (Edemag + Eexch).

37

Chapter 6

Simulation Examples

In this chapter, we discuss two simulation expamples and compare the results with litera-

ture and well established codes to demonstrate that our code is working correctly.

6.1 Ordering of Nanowires

The simplest test structure to demonstrate is the antiferromagnetic ordering in nanomagnet-

wires. Fig. 6.1 shows two wires of different lengths that we investigated. We exposed them

to an external ramping field in order to aid switching in the correct order and let them

settle for a few nanoseconds.

Our simulation results showed that the shorter wire got ordered perfectly while the

longer one had 2 imperfections out of 6 field coupling points. These results were completely

in agreement with OOMMF simulations and earlier experimental results.

6.2 Phase Transition of Nanomagnets

The multi-domain to single-domain transition is one of the most characteristic phenomena

of nanomagnetic behavior. For small magnets; exchange fields, which are reponsible to

align the magnetizations with each other, are stronger and drive the magnet into a ho-

mogeneously magnetized (single-domain) state. For larger magnets, the demagnetization

field takes over the exchange field and promotes the formation of a vortex state. The tran-

sition between the two states takes place at a well defined magnet size. A value closely in

agreement established codes and experimental data can prove the accurate calculations of

38

(a)

(b)

(c)

Figure 6.1: Ordering of nanowires. (a) Perfectly ordered short wire. (b) Imperfectlyordered long wire. (c) Applied external field to aid in the switching of magnets.

magnetostatic and exchange fields as well as the LLG solver. So this is a very comprehen-

sive test of any micromagnetic simulator as it tests almost all the physics in there.

We investigated a 5 nm thick, square-shaped permalloy nanomagnet of different sizes.

For each magnet size, we performed two simuations – (1) starting from a single domain

magnetization state and calculated the magnetic energy after letting it settle, and (2)

starting from a vortex magnetization state and calculated the magnetic energy after letting

it settle.

It is well establshed that when starting from a random state or in presence of ther-

mal noise, a magnet will always converge to the minimum energy state. Thus the phase

39

transition occurs when the energy difference between the two relaxed states is zero.

E1domain − Evortex = ∆E < 0 ; Single domain state

∆E = 0 ; Phase transition

∆E > 0 ; Vortex state

This energy difference is shown in Fig. 6.2, as a function of the magnet size. The figure

also shows that the phase transition occurs at 185 nm, in good agreement with the results

given by well-established codes, such as OOMMF [12]. This benhcmark proves that all the

components of our simulator code are working correctly.

Single Domain Vortex

185x185 nm

Figure 6.2: Phase transition between single-domain and vortex state of a square-shapedpermalloy nanomagnet. Energy difference (∆E = E1domain − Evortex) between relaxedsingle-domain and vortex states is plotted as a fucntion of magnet size and shows that thetransition occurs at around 185×185 nm.

40

Chapter 7

Parallelization on GPU

In this chapter, we discuss about the parallelization of computationally intensive parts

of the simulator on the GPU. We also show performance benchmarks for the parallelized

code. To put the benchmark figures in perspective, we used a workstation having an Intel

Xeon X5650 CPU (6 cores) with 24 GB of main memory and an NVIDIA Tesla C2050

GPU (448 cores) with 3GB of its own DRAM. Although the CPU is 6-core but for the

purpose of benchmarking, we only compared the performance of parallelized GPU code

against that of single-threaded CPU code.

7.1 Parallelization of the LLG Solver

The LLG solver, which is a 4th order Runge-Kutta method (c.f. Sec. 5.2), was easily

parallelized by domain decomposition to achieve a speed-up of 100x (c.f. Fig. 7.1). In

addition, all the other local operations like exchange field (Eq. (2.4)), gradient and diver-

gene operations can be likewise parallelized. However, for any non-trivial sized problem

(number of mesh elements > 100), the potential calculation (Eq. (3.7)) is by far the most

time consuming part (more than 90% of the running time) and so we focused on the GPU

implementation of this part only. We used OpenMP on the CPU to parallelize all the local

operations including the LLG solver.

7.2 Parallelization of the Direct Summation Method

To implement Eq. (3.7), we need two nested loops – sources (ri) and observers (r), and their

order is also interchangeable. NVIDIA GPU architecture also offers two levels of parallelism

41

1E+0 1E+1 1E+2 1E+3 1E+4 1E+5 1E+6

0

20

40

60

80

100

120

Problem size

Spe

ed-u

p

Figure 7.1: Speed-up of LLG solver on GPU.

– multiprocessor (block) level and thread level, so this problem maps straightforwardly to

the GPU.

At the observer (r) level, we have to sum potential contributions from N sources,

Φ(r) =∑

ri∈volume

q(ri)

|r− ri|(7.1)

and to execute for each source (ri), a read-modify-write instruction of the form,

Φ(r) = Φ(r) +q(ri)

|r− ri|; ri ∈ volume (7.2)

This read-modify-write instruction requires data sharing and synchronization among

parallel workers. We map sources (ri) to threads and observers (r) to blocks because there

is no efficient way to share data among the blocks, while doing so among the threads is

highly efficient. For summing up potential contributions from N sources (threads) to a

single observer, parallel reduction algorithm [13] is utilized, which exploits the architectural

features of the GPU efficiently (Fig. 7.2).

42

...

N th

read

s

Block 1 Block 2 Block N

Figure 7.2: Parallel reduction algorithm.

7.3 Paralellization of the FMM

Complexity analysis from Sec. 4.4 shows that running time of potential evaluation through

multipole expansion (Eq. (4.10)) is at least 27Nlayers times greater than that of coefficient

calculation (Eq. (4.12)). So as a first step we paralellized only this part of the FMM

algorithm on GPU. For work sharing scheme, it is natural to distribute work over 27 in-

teraction cells to GPU blocks, since they do not have to share data among each other.

Threads within a GPU block are then mapped to individual mesh points inside the corre-

sponding interaction cell for the evaluation of multipole expansion to contribute to their

potential value.

The downside of this naive approach to FMM parallelization is that the GPU com-

putation is launched in one of the inner nested loops of FMM algorithm, i.e. it has to

be started again and again for each cell in the FMM hierarchy. Therefore several times

during a single FMM run, data has to be transferred between CPU and GPU memories,

which becomes a severe bottleneck. Work is in progress to eliminate this bottleneck by

implementing the entire FMM algorithm on GPU.

7.4 Performance Benchmarks of the Parallelized Code

Firt of all, we verified that the parallel (GPU) and single-threaded (CPU) codes are giving

the same results within the floating-point accuracy.

Direct potential calculation is straightforward to paralellize as discussed already in

Sec. 7.2. Speed-up factors (as compared to the single-threaded CPU version) are shown

in Fig. 7.3. As other components of the simulator are running on the CPU, the GPU

computation has to be launched in each iteration step of the time loop. However, speed-up

43

of up to 30x can be obtained when, for large problems, the GPU launch time becomes

negligible compared to the computation time.

Figure 7.3: Speed-up of direct summation method on GPU. Speed-up factors of about 30xare obtained for large problems.

Fig. 7.4 shows a modest speedup for the FMM-based field solver (3x – 4x). This is due

to the fact that, in the current implementation, only a fraction of the FMM algorithm is

running on the GPU and that the GPU computation is launched several times even for a

single potential calculation.

For part of the potential calculation, at coarser levels of FMM hierarchy, a very large

number of potential values have to be evaluated using the same coefficients of the multipole

expansion and the GPU is launched fewer times. These parts of the FMM algorithm show

a 10x speed-up on GPU. This suggests that implementing the entire FMM algorithm on

GPU would accelerate the calculation at least that much in the future.

Yet in this unoptimized version, the parallelized FMM has its strength for very large-

sized problems (N > 2 · 105), where it steeply outperforms the direct summation method

(Fig. 7.5) due to its better O(N logN) complexity. For smaller problems, however, the

well-paralellized direct summation method runs faster.

44

Figure 7.4: Speed-up of FMM on GPU. Speed-up factors of about 3x – 4x are achievedbut are as high as 10x for the coarsest part of the FMM algorithm.

Figure 7.5: Comparison of running times of direct summation method and FMM on GPU.FMM runs faster for problems larger than 2 · 105 mesh points due to its O(N logN)complexity.

45

Chapter 8

Conclusion and Future Work

The goal of this work was to build a three-dimensional, parallelized and GPU-accelerated

micromagnetic simulator which is geared towards the simulation of nanomagnetic com-

puting devices. We started the work ‘from scratch and implemented all the algorithms

without reusing program modules from any other source. The most involved part of the

work is the magnetostatic potential (field) calculation. We successfully implemented two

algorithms for field calculation.

1. Direct summation

2. Fast Multipole Method (FMM)

The direct potential calculation is straightforward to parallelize by domain decomposi-

tion and without much optimization we already achieved a speedup of 30x compared to a

single-threaded CPU version. The FMM, however, is far more complex and so far only a

fraction of the algorithm is ported on GPU this resulted in a moderate speedup of 4x.

We also tested our simulator against standard micromagnetic problems and found it

to be working very accurately in agrrement with well established codes such as OOMMF

[12]. Our work resulted in

1. a fully customizable, high performance micromagnetic simulator that will be the basis

for future simulation work of research groups working on nanomagnetic logic devices,

and

2. demonstrated the possibility of achieving significant speed-up by performing the most

computationally intensive parts on calculations the GPU.

46

In the next phase of the work, we focus on porting the entire field-calculation routine

on the GPU and optimizing the FMM algorithm in order to avoid the bottleneck of too

many time-consuming memory transfers between the CPU and the GPU. We hope that

speed-ups of upto 50x will be achievable for the FMM based calculation.

47

Bibliography

[1] Jacques E. Miltat and Michael J. Donahue. Numerical micromagnetics: Finite differ-

ence methods. In Handbook of Magnetism and Advanced Magnetic Materials, volume

2: Micromagnetism. John Wiley & Sons, 2007.

[2] A. Imre, G. Csaba, L. Ji, A. Orlov, G.H. Bernstein, and W. Porod. Majority logic

gate for magnetic quantum-dot cellular automata. Science, 311 no. 5758:205 –208,

2006.

[3] Gyorgy Csaba. Computing with Field-Coupled Nanomagnets. PhD thesis, University

of Notre Dame, 2003.

[4] J.D. Owens, M. Houston, D. Luebke, S. Green, J.E. Stone, and J.C. Phillips. GPU

Computing. Proceedings of the IEEE, 96(5):879 – 899, May. 2008.

[5] Shaojing Li, B. Livshitz, and V. Lomakin. Graphics processing unit accelerated O(N)

micromagnetic solver. Magnetics, IEEE Transactions on, 46(6):2373 –2375, jun. 2010.

[6] Richard L. Coren. Basic Engineering Electromagnetics – An Applied Approach.

Prentice-Hall, 1989. Chapter 7.

[7] Leslie Greengard. The Rapid Evaluation of Potential Fields in Particle Systems. MIT

Press, Cambridge, MA, 1998.

[8] N.A. Gumerov, R. Duraiswami, and E.A. Borovikov. Data structures, optimal choice

of parameters, and complexity results for generalized multilevel fast multipole methods

in d dimensions. Computer Science Technical Report 4458, University of Maryland,

College Park, 2003.

[9] The Best of the 20th Century: Editors Name Top 10 Algorithms. SIAM News, 33(4).

48

[10] Gregory Brown, M. A. Novotny, and Per Arne Rikvold. Langevin simulation

of thermally activated magnetization reversal in nanoscale pillars. Phys. Rev. B,

64(13):134422, Sep 2001.

[11] John David Jackson. Classical Electrodynamics. John Wiley & Sons, 3rd edition,

1999. Chapter 2 & 3.

[12] The Object Oriented MicroMagnetic Framework (OOMMF). http://math.nist.

gov/oommf. A widely used, public-domain micromagnetic software micromagnetic

program.

[13] Mark Harris. Optimizing parallel reduction in CUDA. In NVIDIA CUDA SDK –

Data-Parallel Algorithms. 2007.

49

http://math.nist.gov/oommf

http://math.nist.gov/oommf

Documents

MSc Thesis