Advances in Engineering Software - PUC-Riowebserver2.tecgraf.puc-rio.br/~lfm/papers/Freitas... · after performing the shifts, the work of Löhner builds a new coar- ser octree to

Advances in Engineering Software 59 (2013) 38–52

Contents lists available at SciVerse ScienceDi rect

Advances in Engineering Software

journal homepage: www.elsevier .com/locate /advengsoft

A distributed-memory parallel technique for two-dimensional mesh generation for arbitrary domains

0965-9978/$ - see front matter � 2013 Elsevier Ltd. All rights reserved.http://dx.doi.org/10.1016/j.advengsoft.2013.03.005

⇑ Corresponding author. Address: Department of Computing (DC), Federal University of Ceará (UFC), Campus do Pici, Humberto Monte Av., Fortaleza, CE 60.455-760, Brazil. Tel.: +55 85 3366 9847; fax: +55 85 3666 9837.

E-mail address: [email protected] (M.O. Freitas).

Markos O. Freitas a,⇑, Paul A. Wawrzynek b, Joaquim B. Cavalcante-Neto a, Creto A. Vidal a, Luiz F. Martha c,Anthony R. Ingraffea b

a Computer Graphics, Virtual Reality and Animation Group (CRAb), Department of Computing (DC), Federal University of Ceará (UFC), Brazil b Cornell Fracture Group (CFG), Department of Civil and Environmental Engineering, Cornell University, United States c Computer Graphics Technology Group (Tecgraf), Department of Civil Engineering, Pontifical Catholic University of Rio de Janeiro (PUC-Rio), Brazil

a r t i c l e i n f o

Article history:Received 13 June 2012 Received in revised form 21 March 2013 Accepted 24 March 2013

Keywords:Parallel mesh generation Advancing front technique ParallelismTriangulationRecursive decomposition Quadtree

a b s t r a c t

This work describes a techni que for generating two-dimensional triangular meshes using distributed memory parallel computers, based on a master/slaves model. This techni que uses a coarse quadtree to decompo se the domain and a serial advancing front techni que to generate the mesh in each subdomain concurrently. In order to advance the front to a neighboring subdomain, each subdomain suffers a shift to a Cartesian direction, and the same advancing front approach is performed on the shi fted subdomain.This shift-and-remesh procedure is repeatedly applied until no more mesh can be generated, shifting the subdomains to different directions each turn. A finer quadtree is also employed in this work to help estimate the processing load associated with each subdomain. This load estimati on technique produces results that accurately represent the numbe r of elements to be generated in each subdomain, leading to proper runtime prediction and to a well-balanced algorithm. The meshes generated with the parallel technique have the same quality as those generated serially, within acceptable limits. Although the presented approach is two-dimensional, the idea can be easily extended to three dimensions.

� 2013 Elsevier Ltd. All rights reserved.

1. Introductio n

This work presents a parallel techniqu e for generating two- dimensional triangular meshes by the advancing front method.The techniqu e was designed to meet four requiremen ts: to respect the input front, discretized in segments, i.e., no boundary refine-ment can be employed; to produce well-shaped elements , avoiding elements with poor aspect ratios; to provide good transitions among refined and coarse regions of the mesh; and to generate meshes efficiently in terms of time. The algorithm is based on a serial 2D and 3D advancing front strategy develope d by the authors [3,9,10].

The first requirement is very important in many problems, such as those encountered in simulations in which the domain contains regions with different materials and/or holes. In these problems, it is often desirable that the mesh conform to an existing boundary discretizatio n of those regions. This requiremen t can be also very important in problems where remeshing can be used, such as in crack growth simulatio ns, since it allows remeshing to occur

locally in a region near the crack tip. Our parallel approach can also be extended to crack growth problems .

Regarding the second requiremen t, although the proposed technique does not guarantee bounds on element aspect ratios, care is taken at each step to generate elements with the best possible shapes. In Section 4, it is shown that the technique is successfu lin meeting this requiremen t.

Concerning the third requiremen t, in many applications, the size difference between elements in a refined region and those in a coarse region is larger than two orders of magnitud e. Thus, to provide good transition capabiliti es is an important requiremen tin practical problems.

To achieve the fourth requirement, the proposed techniqu e uses a distributed-mem ory computer architectur e with a very high load-balanc ing capability . This kind of architecture has been widely available nowadays with the cheapening of cluster computers. The parallel technique presented in this work uses the master/ slave parallelism paradigm and is easily expandable to three dimensio ns.

The remainder of this work is divided into four sections. The following section describes the related work. Section 3 describes the parallel technique develope d, where a hypothetical two- dimensio nal model (a disk) is used to illustrate the process. Some examples are shown in Section 4, where the important measure s

http://crossmark.dyndns.org/dialog/?doi=10.1016/j.advengsoft.2013.03.005&domain=pdf

http://dx.doi.org/10.1016/j.advengsoft.2013.03.005

mailto:[email protected]

http://dx.doi.org/10.1016/j.advengsoft.2013.03.005

http://www.sciencedirect.com/science/journal/09659978

http://www.elsevier.com/locate/advengsoft

M.O. Freitas et al. / Advances in Engineering Software 59 (2013) 38–52 39

of load estimation, speed-up and mesh quality are assessed. Finally,some conclusions are drawn in Section 5.

2. Related work

Parallel mesh generation techniques can be classified accordin gto their domain decompo sition approach . Chrisoch oides [6] for in- stance, classifies them in discrete domain decompositi on (DDD)and in continuous domain decompositi on (CDD). The techniques in the first approach start by serially generating a coarse mesh for an input boundary. Then, that coarse mesh is partitioned using a mesh or graph partitioning technique [14,15,27 ]. The external boundary and the internal interfaces are refined, and each partition can be meshed independen tly from the others. Some works of the literature are based on a DDD approach [7,11,12,16,20,24,26,28,30]. Although some of these works are in three dimensions , they are mentioned here because the same ideas can be employed in two-dimensi onal techniques.

In the CDD approach, except for partitioning a coarse mesh, any method can be used to decompose the domain. Thus, methods such as quadtree/oc tree [4,8,21], axis/planes [13,18,19,31], blocks [5,22,29], spatial sorting [1] or data structure s partitioning [17]are among the possible choices.

The algorithm described in this work uses the CDD approach and generate s interface meshes between subdomains in an a pos- teriori fashion, for the following reasons. First, it is not desirable to create a coarser mesh as a compulsory preceding step to a finermesh generation, since a boundary refinement is not allowed due to the first requiremen t presented in Section 1; second, it is desirable to partition the geometry present in the input, and a graph or mesh partitioning technique is not usually designed for this kind of task; third, as observed in Ref. [6], an a priori generation of interface mesh between two subdomains could lead to artifacts, such as poorly-shaped elements or rough transitions between regions with small elements and regions with large elements , invalidati ng the second and third requiremen ts.

In Refs. [1,5,4,17,22], a 2D Delaunay mesh is generate d at the outset, or it is given as input, and refined in parallel. The refine-ment is performed by parallel insertion of new points in different locations of the mesh, ensuring that they do not conflict, i.e., the elements refined by the insertion of a new point will not be af- fected by the simultaneou s insertion of other points. To ensure this, Chernikov and Chrisochoides [5] partition the given mesh using equally sized boxes and, in Ref. [4], the same authors propose a partition by quadtree. Okusanya and Peraire [22] also employ block partitioning, but use inter-proces ses communi cation to avoid invalid parallel insertions. Kohout et al. [17] use data-structu res partitioning and Batista et al. [1] use a spatial sorting algorithm to distribute insertion points among concurrently working threads.Both works employ shared-mem ory mechanism s to avoid invalid concurrent generation of new triangles. Wu et al. [29] perform aparallelization of a divide-and-con quer 2D Delaunay triangulatio nto account for huge data sets.

Lämmer and Burghardt [18] split the domain, given as a 2D boundary description, with a line drawn along the region’s principal axis (maximum moment of inertia). This procedure is repeated recursively until the number of subdomains equals the number of processor s. Then, boundary curves and interior lines are refined in segments for meshing purposes . Finally, subdomain triangular Delaunay or quadrangular meshes are generated in parallel. In Refs. [13,19,31], a plane that splits the input 3D boundary is defined. Larwood et al. [19] and Ivanov et al. [13] split the domain with an inner surface mesh on that plane. Zagaris et al.[31], however, instead of generating faces on the plane, apply an advancing front technique (AFT) only to the boundary faces that

are intersected by the plane, generating a layer of tetrahedral elements that decouples the domain. In the three works, the same procedure is performed recursivel y, generating disconne cted fronts, which are advanced in parallel in different processors.

In Refs. [8] and [21], the input domain is decomposed by an octree, and an advancing front techniqu e is used in the subdomains.De Cougny and Shephard [8] generate the inner mesh through template meshing in a fine octree. An AFT creates the elements connectin g the input boundary and the template-ge nerated mesh,as long as they do not intercept the planes defined by the octree cells. These two tetrahedral meshing steps are performed in parallel. A three-pa ss inter-pro cess communicati on step is used to con- struct a mesh connectin g the generated meshes: first, between two adjacent cells (separated by a plane); then, among the cells sharing the same edge (defined by two intersect ing planes), and finally,among the cells sharing the same vertex (defined by more than two intersect ing planes).

In Ref. [21], an AFT is the only technique used to mesh the entire domain. After building a coarse octree, the cells on the input boundary are meshed concurrently, as long as the tetrahedra do not cross the cell. The cells suffer an empirically defined diagonal shift to reduce the front size, eliminating almost all the faces between two adjacent cells. Then, an even coarser octree is generated, and the same procedure is performed , until no more mesh can be generate d.

Although there are some similarities, the work of Löhner [21]and the present work are different in several aspects. First, every subdoma in in his work suffers the same fixed diagonal shift while,in the present work, a Cartesian shift is applied and varies according to the size of the subdoma in. Second, the shift proposed in Löh-ner’s work has the purpose of diminishing the number of faces in the front, and the factor that enables AFT to continue is the crea- tion of a coarser octree. In the present work, the subdomain shift has a greater importance in the AFT procedure, since it is responsible for advancing the front to a neighboring subdoma in. Third,after performing the shifts, the work of Löhner builds a new coarser octree to decompo se the current domain, while the present work uses only one such tree througho ut the procedure.

The main difference between both works lies in the load estimation techniqu e. In Löhner’s work, the load is estimated by a regular partitioni ng of a subdomain. Since this load estimation represents a uniform mesh, the accuracy is not guaranteed because it does not account for non-uniformit y of the generated mesh and,therefore, could lead to imbalances. The present work solves this problem using a fine quadtree for load estimation, because it can represent regions with coarse and fine mesh adequately. Further- more, in the present work, load estimation guides the generation of the subdoma ins, instead of estimating the load after the subdomain generation.

3. Descript ion of the parallel technique

The present technique receives as input a list of segments defin-ing the boundary (which is the initial advancing front) of one or more objects, which may have holes. This boundary defines a domain that is decomposed by a quadtree, which is built taking into account the amount of work necessary for generating a mesh in the leaf-cells . The subdomains correspondi ng to the leaves of the quadtree are distributed among the slave processors, which are responsible for generating subdomain meshes.

After generating meshes in the initial subdomains, the front is updated and the decompo sition quadtree cells are shifted to a Carte- sian direction. The shifted subdoma ins are again distributed among the slave processors, which continue the meshing process, updating the front. This procedure is repeated, shifting the decomposition

40 M.O. Freitas et al. / Advances in Engineering Software 59 (2013) 38–52

quadtree cells to other directions , until it is no longer possible to advance the front. The idea of shifting was used before in Ref. [21], but it is used differently here, as pointed out previously.

After the distributed mesh generation is performed by the slave processors, the master processor finalizes the mesh, generating amesh in the remaining unmeshed regions and improving the quality of the mesh in regions between subdomains. Fig. 1 shows an overview of the parallel technique.

3.1. Load estimation

In high performance computing (HPC), load is a measure of the amount of work to be performed by a processor or by a group of processors. In mesh generation problems , the load is related to the number of elements that will be generated in each subdoma in.Therefore, in the present work, the following issues should be taken into account in load estimation (Figs. 2 and 3):

� Load estimation depends on the discretization level of the mesh,which is usually specified by the user or by another software,through input parameters. Load is greater in subdomains with higher discretizatio n levels (Fig. 2).� In regions with the same discretiza tion level, load estimation

depends on the subdomain size. Load is greater in larger subdomains (Fig. 3, left).� In subdomains of equal sizes with varying discretiza tion levels

(element size transition), load estimation depends on the number of generated elements in each subdomain (Fig. 3, right).

The mesh on the right of Fig. 2 implies a greater load than the mesh on the left of the same figure. The same discretizatio n of the boundary was used in both cases. In Fig. 3, the load associated with the regions surrounded by solid lines is greater than the load associated with the regions surrounded by dashed lines.

In this work, in order to obtain good load estimation, an auxil- iary quadtree is carefully constructed in such a way that it reflectsthe element size distribution within the object’s domain. This implies that an overall idea about the element size distribution of the desired mesh has to be known at the outset. Fortunately , this information can be inferred from two important characterist ics of the present meshing algorithm in relation to the element size distribution [3,9,10].

No

Yes

Front

Load balancing

Serial algorithm Serial algorithm

Finished?

Decomposition Shifting

Finalizing

Fig. 1. Overview of the parallel technique.

The first characteristic is that the largest elements in the interior and on the boundary of the mesh should have the same order of magnitude. This means that the maximum size for mesh elements is given using an interpolation of the boundary mesh size,which possibly is not the preferred case in some applications , but is not uncommon in practical problems, such as the ones studied by the authors. The second characteri stic is that the meshing algorithm enforces good transitions between coarse and refined regions of the mesh.

Since these characteri stics are related to the element sizes, they must be represented in the load-estima tion quadtree. After the initial load-estima tion quadtree is built, it undergoes two refine-ments: the first one to prevent an internal cell from being larger than the largest boundary cell, and the second one to enforce amaximum tree level differenc e of one between adjacent cells (thisis known in the literature as 2:1 refinement). More details on how this quadtree is built can be found in Ref. [9] (the building of its 3D counterpar t, an octree, can be found in Refs. [3,10]). After a load- estimation quadtree is built for the model, the total load is computed as the number of leaf-cells that are inside or on the boundary of the model domain.

A simplification of the load-esti mation quadtree may be adopted: although the characterist ic of having maximum cell size for an internal cell is necessar y, the number of leaf-cells of a completely internal cell C can be computed without having to refinethe quadtree. This number is calculated as 4d�d(C), where d is the desired depth of the tree and d(C) is the depth of C. This simplifica-tion can save significant runtime and memory, especially when the problem involves the generation of very large meshes. The 2:1 refinement has to be fully applied, since it can change the number of leaf-cells of C.

In Fig. 4, a load-estimatio n quadtree with 2:1 refinement is illustrate d. The cells considered for load estimation are the ones inside the domain (darker cells) and the ones on the boundary (filledcells). A load-estima tion quadtree is constructed at the beginning of the meshing procedure, and only the associate d load estimation has to be performed after every shift of the decompositi on quadtree (see Section 3.5).

3.2. Domain decompositio n

The domain decomposition for distribut ion of the mesh generation work among the processes also uses a quadtree data structure, which is guided by the load-estimatio n quadtree described in Section 3.1. This data structure is referred to in this work as the decomposition quadtree and is built in such a way that the load associate d with each of its leaves is smaller than a predefined maximum load. This load threshold is a function of the total load and the number of available slave processes.

To determine this function, the worst case from the point of view of the load-estimatio n quadtree is considered. This case is the one that leads to the heaviest possible load-estimatio n quadtree and, thus, observes the following issues:

� All its leaf-cells are classified as either inside or on the boundary of the domain, because leaf-cells outside the domain are not counted to estimate the load.� The quadtree is full, because a quadtree with leaf-cells of differ-

ent depths has fewer leaf-cells than a quadtree leveled by its largest depth.

This load-estima tion quadtree is the one that represents asquare-shap ed domain, upon which a uniform mesh would be generate d.

If L represents the total load, ideally N subdomains would have a load L/N each. This would lead to a regular subdivision of the

Fig. 2. Examples of different mesh discretization levels: less refined mesh (left) and more refined mesh (right).

Fig. 3. Subdomains with same discretization levels and different sizes (left) and subdomains with equal sizes and varying discretization levels (right).

Fig. 4. Load-estimation quadtree. Fig. 5. Example of a decomposition quadtree.


decompositi on quadtree into affiffiffiffiNp�

ffiffiffiffiNp

grid. It is desirable that the number of subdomains covering the input boundary be, at least, equal to the number of slave processes P. This number of

subdomains is, in the square-shaped case, 4ffiffiffiffiNp� 1

� �, calculated

as the number of subdomains per side (4 sides) and removing the subdoma ins counted twice (4 corners). Setting

P ¼ 4ffiffiffiffiNp� 1

� �leads to a load per subdoma in of, at most, L/(P/

4 + 1)2.

The decomposition quadtree starts by setting its root to the bounding square of the domain, and by setting its load to the total estimate d load. If this cell’s load is greater than the load threshold,the cell is subdivided into four equal-sized cells. The procedure is repeated recursively for the cell’s children until the load of a cell does not exceed the established load threshold (Fig. 5). Each leaf of the decomposition quadtree that intersects the model’s boundary or that is inside the model is considered a subdomain.

Fig. 6. Advancing front modifications: segments crossing the bounding box (left)and search region crossing the bounding box (right).

Fig. 7. Front and bounding box of a subdomain: received front (left) and new front (right).

Fig. 8. Mesh of a subdomain: unimproved mesh (left) and improved mesh (right).

Fig. 9. New cell classification of the load-estimation quadtree.

Fig. 10. Original front, updated front, and modified load-estimation quadtree.


The load associated with a given cell of the decompositi on quadtree is estimated as the number of leaf-cells of the load- estimation quadtree, which are not outside the domain and are inside that cell of the decompo sition quadtree.

The segments of the input front are distributed among the existing subdoma ins. Thus, if the two vertices of a segment are inside a subdomain, the segment is said to belong exclusively to that subdomain. However, if the segment crosses two or more subdomains, or touches the boundaries of some subdomains, that segment is said to belong to all of the crossed and touched subdomains.

The use of a quadtree for domain decompositi on and of a different quadtree for load estimation is important for the following reasons. First, these quadtrees are completely independen t, i.e., it is possible to use any load-esti mation method together with a domain decompositi on quadtree, or to use a load-estima tion quadtree in combinati on with any domain decompo sition techniqu e.In spite of this independen ce, in Section 4, it is shown that using both quadtrees together leads to a domain decomposition in which the loads for meshing the subdomains are well estimated. Second,since the load-estima tion quadtree is usually very refined, it would not be a good decompo sition quadtree. If it were used to decompose the domain, the load associated with each subdomain would be too low, and very small speed-up would be observed, if any.Third, in the shifting procedure (Section 3.5), the load in each subdomain must be re-estimated and, therefore, the load- estimation quadtree must be modified; however, this modification does not make sense in the context of domain decompo sition. Thus, the decompositi on quadtree is kept topologically the same throughout the present work.

3.3. Load balancing

In this work, the subdoma ins are sent to the slave processes on demand. Initially the master process retains all subdomains. Aslave process requests a subdomain from the master process,

and, as soon as its request is granted, the slave process starts its mesh generation work, while the master process awaits another request. When a slave process finishes working on a subdoma in,it sends the master another request. Once all the subdoma ins are sent to the slave processes, the master awaits the results.

It is important to mention that only subdoma ins that have front segments are sent to slave processes. Remaining subdoma ins are kept in the master process, to avoid unnecessary communication,and will be considered when the front reaches them during the shifting procedure (Section 3.5).

3.4. Mesh generation by a slave process

Each slave process is responsible for mesh generation in asubdoma in, which is defined by a part of the advancing front and by a bounding box. The local meshes (submeshes) in these

Fig. 11. Mesh generated, scattered among the slave processes, and decomposition quadtree cells shifted to the +X direction.

Fig. 12. Remaining cavities (left) and final mesh (right).

Fig. 13. Input boundaries: Key model [25] (left), Cylinder model (center) and Plate model [23] (right).


subdomains can be generated independen tly from one another,using, in each process, the serial AFT developed in Refs.[3,9,10] by some of the coauthors of this work. Geometri c tree data structures are used to speed up the search of candidate vertices for generating a new triangle and the search of possible intercepting edges, ensuring a fast execution of the mesh generation procedure.

In order to ensure that no triangle is generate d outside the limits of a subdomain, the following conditions were added to the serial AFT:

1. A segment crossing the bounding box of the subdomain is not used to advance the front (Fig. 6, left).

2. A segment that is strictly inside the bounding box of the subdomain is not used to advance the front if any valid well- shaped triangle formed with it crosses the bounding box. In other words, consider the search region for the placement of a new vertex that together with the given segment forms a valid triangle. If that search region, a circle in the present work, crosses the bounding box, then the segment is not used to advance the front (Fig. 6, right).

Fig. 14. Subdomains generated with 8 processes: Key (left), Cylinder (center) and Plate (right).

Fig. 15. Meshes generated by different processes.


Each slave process tries to advance the front as much as possible. For example, Fig. 7 shows the received front (left) of a slave processor and the updated front (right). In this process, a submesh is generated (Fig. 8, left), upon which a mesh improvement is applied (Fig. 8, right). This improvem ent is a combinati on of a Lapla- cian smoothing and a backtrackin g mesh optimizati on procedure,and more details can be found in Refs. [2,3,9,10]. Notice that the updated front cannot change when smoothin g or any other optimization technique is performed, because there is not enough adja- cency information for its segments or vertices. Only the internal vertices of the submesh can be optimized.

The received front, except for the segments that did not advance, together with the updated front form one or several polygons (Fig. 7, right). These polygons are used by the slave process in order to reclassify the cells of the load-esti mation quadtree according to the updated domain, which is the region still not meshed. Thus, the cells that are strictly inside the polygons are considered to be outside the updated domain; and the cells that cross the received front are also considered to be outside the updated domain. The cells crossing the updated front are determined,and the remaining cells retain their classification. In Fig. 9, the darker cells are the ones inside the updated domain, the filled cells lie

Fig. 16. Meshes generated by different process

on the updated front, and the lighter cells lie outside the updated domain.

This classification will be used by the master process in the next step of the proposed distributed meshing algorithm, along with the updated front. The master process stores the updated front information and manages the overall classification of cells of the load- estimation quadtree. However, the mesh generate d by each slave process is maintain ed in its own memory.

3.5. Shifting procedure

Once the master process receives the reclassification of the load-esti mation quadtree and the updated front for all subdomains, it updates the overall front information, inserting newly appearing segments and removing the segments that were advanced. In addition, the load-estimatio n quadtree is refined to conform to the updated front. The part of the quadtree representing the region strictly outside the updated front may be pruned (Fig. 10 ).

After that, every leaf-cell of the decomposition quadtree is shifted by half its size to a Cartesian direction, for example, the positive X direction (+X).

This shifting procedure makes it possible to advance the front and to generate mesh in regions that were not possible considering the non-shifted decompo sition quadtree. The shift of a subdoma in is performed in such a way that its new location is not too distant from its original position, and yet it will significantly advance the front and generate mesh in unmeshed regions.

Notice that two shifted decompositi on cells must not overlap,otherwis e intercept ing elements would be generated in different subdoma ins. Therefore, if a cell has two or more neighboring cells in a shifting direction, this cell is distorted to a rectangular shape so that this non-overlap ping restrictio n is satisfied (dark-filled cells in Fig. 11). The edge of the subdoma in shared by these neighboring cells suffers the same shift as the smallest neighboring cell, i.e., the one with greatest depth in the decomposition quadtree structure.

es and the master process (black region).

Fig. 17. Runtime estimation of subdomains: Key (top), Cylinder (center) and Plate (bottom).


As can be seen in Fig. 11, there may be gaps between subdomains, due to the cell shift, that do not interfere with the execution of the procedure. There would be a small gain if some subdoma ins adjacent to a gap were also largely distorted to fill it, and this procedure can be expensive in some situations. Moreover, the primary intention of the present work is to avoid changes in the shape of the decomposition quadtree cells as much as possible. Therefore,the distortion is applied only to avoid overlapping of subdomains.

After shifting the decompositi on quadtree cells, the master process divides the front segments among them, recalculates their load based on the updated load-esti mation quadtree, and sends this information to the slave processes in order to advance the front and generate additional mesh (Sections 3.2, 3.3 and 3.4 ).

The shifting is performed sequential ly in the following Carte- sian directions: to the right (+X), upwards (+Y), to the left (�X),and downwards (�Y). After that, the decomposition quadtree re- turns to its original position, and the cycle is restarted if there is at least one cell still unmeshed . Therefore, when every cell has been meshed in every possible direction, the shifting procedure is finished.

Notice that, if a decompositi on cell is shifted to a direction where it has no neighboring cell, it will not generate any mesh.Moreover, if a decompositi on cell shifted to a direction (+Y, for example) generates mesh, its neighbori ng cells in that direction do not need to shift to the opposite direction in the future (�Y,

for example), because the regions between these cells have already been meshed. This is not always true, since, when a small cell is shifted towards a larger cell, the shift of the larger cell in the re- verse direction may still generate mesh. These basic checks can save computation as well as communi cation time.

After some shifting cycles, as the front advances towards the interior of the domain, the number of subdomains might get smaller than the number of slave processes. However, if the decomposition quadtree were modified to keep up with the number of slave processes , at some point, the subdomains would be too small, so that the overhead for the inter-process communicati on would be too much for too little work performed by the slave processes .From this moment on, the use of parallel processin g would not be beneficial. Thus, at this point, the modified decomposition quadtree would need to have fewer subdomains than the number of processes, which is equivalent to not modifying the decomposition quadtree, as pointed out in Section 3.2. This might leave some regions not meshed, which are treated in the mesh finalizingphase.

3.6. Mesh finalizing

After the shifting cycle, the parallel part of the mesh generation finishes. However, cavities without mesh may still exist in the interior of the domain (Fig. 12 , left). When that is the case, the master

Fig. 18. Estimation of number of elements and vertices for subdomains: Key (top), Cylinder (center) and Plate (bottom).


process finalizes the mesh generation procedure, applying a standard AFT in order to fill the cavities with triangles.

Next, the master process improves the parts of the mesh that were not improved by the slave processes, using the same smoothing/backtrack ing optimization procedures. Those parts consist of front segments passed by the master process along with certain layers of elements adjacent to these segments.

Layer 0 consists of the front segments themselves, and layer Ncomprises the elements present in layer N � 1 plus their adjacent elements. These layers are stored in the memory of the slave processes, and should be gathered by the master process. It was veri- fied [12] that two layers of elements are enough for a good mesh.After this step, the mesh is completely generated (Fig. 12 , right),distributed in slave and master memories. Thus, if one wishes,the master process can join the mesh pieces properly.

4. Examples

The technique described in this work was implemented in C++,using MPI for interprocess communicati on. The computer used to run the tests was a cluster computer where 10 nodes were available for use by the authors. Each node had two six-core Intel �

Xeon™ processors (2.66 GHz) and 24 GB RAM. To avoid memory contention, only one process per node was used in the test runs.

4.1. Meshes

Fig. 13 shows the input boundaries used for the parallel mesh generation procedure. Each segment of the boundari es was refinedto make necessar y the use of high performanc e computin g. Mesh generation subdomains used with 8 processes are shown in Fig. 14 . Each color indicates a different process, and the black color is reserved for the master process (see Fig. 16 for a highlight of aregion of the mesh generated by the master process).

Skinny colored regions located between two subdoma ins, as seen in Fig. 14 , are due to the shifting procedure. After two neighboring subdomains are processed in the slave processes , a narrow unmeshed region lies between them. Mesh in that region is generated during the shifting cycle. If the task of generating mesh for the shifted domain is assigned to one of the processes that generated mesh in the neighbori ng subdomains, the colors of the newly generated elements are the same as those of the subdomain meshed by the assigned process. Otherwise, the elements receive a different color.

Figs. 15 and 16 highlight some interesting areas of the Key model, in order to show how the meshes generated in parallel fit together neatly. Fig. 15 highlights a region between two subdoma ins. As can be seen, the shape and size functions in generated triangles were respected, even if they were generated by different processes in different subdomains. Fig. 16 shows the same

Fig. 19. Error between estimated load and actual runtime for subdomains: Key (top), Cylinder (center) and Plate (bottom).


features in a region generate d by the master process, in the finaliz-ing step. The sizes of the generated meshes were approximat ely 390, 320 and 490 thousands of elements, respectively .

4.2. Load estimation

In order to show that the load estimation described in Sec- tion 3.1 is accurate, the runtime taken to generate mesh in each subdomain was evaluated and compared to the estimated load for that subdoma in. Fig. 17 compares the runtime of each step in the mesh generation procedure and the estimate d runtime for some subdomains. Representing all subdomains would not fit the chart and it is not necessary for this comparison. A subdomain may appear repeated, but, each time, it corresponds to a different shifted position.

In Fig. 17 , the steps search tree building, advancing front procedure and mesh improvement belong to the employed serial mesh generation procedure [3,9,10], and the tree classification step was described in Section 3.4. The overhead accounts for the extra time wasted doing unproductiv e work.

The polylines in Fig. 17 represent the estimate d runtime, which was calculated as the load scaled down by the factor �t=�l, where �t isthe average runtime, and �l is the average load estimated for the subdomains. Since only the proportio n among the loads is necessary for load balancing, this fixed scaling factor can be applied for evaluation purposes . Fig. 18 shows the same comparisons , but

consideri ng the number of elements and the number of vertices instead of runtime. As can be seen, the estimation is accurate.

Fig. 19 shows the error between the estimated runtime and the actual runtime for each subdomain, in percentage. A positive error means that the actual runtime was less than the estimated runtime. A negative error means that the actual runtime was greater than the estimated runtime. An absolute error, therefore, means how different the estimate d runtime is from the actual runtime.The maximum absolute error shows the worst runtime estimation.The figure also shows the average and the standard deviation for the absolute error.

Table 1 shows the average and standard deviation values for the differenc e between actual and estimate d runtime (R and r(R)). This same analysis was performed for the number of vertices (V andr(V)) and the number of elements (R and r(E)) generated in a subdomain. These values show that the runtime error is usually around 10.0%. For number of vertices the error is usually less than 6.0% and, for elements, the error is less than 5.5%. All these data show that the employed method, based on a quadtree, is a good predictio n for the load of a subdoma in.

4.3. Runtime and speed-up

The charts in Fig. 20 show the runtime along with the speed-up reached by the implementati on of the proposed techniqu e. The speed-up is a metric that tells how a parallel impleme ntation is

Table 1Load estimation error statistics.

Absolute error Key (%) Cylinder (%) Plate (%)

Runtime

R 5.11 6.37 5.34 r(R) 3.40 4.43 5.07

Vertices

V 2.00 3.45 2.46 r(V) 1.46 2.80 2.12

Elements

E 0.55 2.87 1.49 r(E) 0.43 2.47 1.52


faster than its serial counterpart, and it is calculated as Ts/Tn, where Ts is the time taken for the serial impleme ntation to finish and Tn isthe time taken for the parallel implementation running with n pro-cessors to finish. Ideally, a parallel implementati on would have alinear speed-up, meaning that n processors make it n times faster.In practice, this is difficult to achieve, due to inevitable serial por- tions present in a parallel algorithm.

As seen in Fig. 20 , the implementation presented a reasonably good speed-up . Figs. 21 and 22 depict the absolute and relative runtime of each phase described in Section 3. During the mesh generation by slave processes phase, the master process remains idle, awaiting for requests or results from the slave processes. All the other phases happen serially in the master process.

In Fig. 21 , note that the runtime for the load-estima tion quadtree phase remains constant regardles s of the number of slave processes. However, the runtimes of the synchroniza tion phases (domain decomposition, load balancing and shifting procedure)as well as the runtime of the mesh finalizing phase grow with the number of subdomains, which increases with the number of processes. Thus, those phases may have relatively long runtimes (Fig. 22) for a larger number of processes .

If one considers only the runtime for the meshing related phases (mesh generation by slave processes and mesh finalizing),without taking into account the synchronizatio n phases, the slave processes, as expected, are the ones that perform most of the job.That can be observed in the plots of absolute and percentage runtimes, for the meshing phases only, shown in Figs. 23 and 24 . Those plots also show that the meshing phases, which are the most important ones, achieve reasonably good runtimes.

Notice that the percentage of the runtime of the mesh generation by slave processes phase in the Cylinder model is larger than in the other models. That happens due to the wider space inside the model available for the mesh generation. In the other narrower models, the load-estima tion quadtree might take more time (Fig. 22), since it is performed serially in the master process in the current implementati on. Moreover, in the mesh finalizingphase, the improvement of the element layers is the part that spends most of the time, because it is performed serially in the master process as well. Therefore, these phases are candidat es for optimization and parallelization efforts in the future. Nonethe- less, notice in Fig. 20 that the runtimes have a fairly good decay,which is an advantage in practical situations.

Fig. 20. Runtime and speed-up: Key (top), Cylinder (center) and Plate (bottom).

4.4. Quality

To demonstrat e that the meshes generated by the described technique have good quality, the metric used was a = 2Ri/Rc, where Ri and Rc are the radii of the inscribed and circumscribed circles,respectively . This metric a has value 1.0 for an equilateral triangle.The worse the quality of an element is, the closer the value of agets to 0.0. Elements with a 6 0.1 are considered to have very poor quality, while elements with a P 0.7 have good quality.

The charts in Fig. 25 show the quality of the meshes generated by several executions of the algorithm. Due to the inherent non- determinism of parallel algorithms, different executions of a given program with the same input and the same parameters may generate different outputs. Therefore, it is necessary to analyze the quality of all the meshes generated.

As can be seen in Fig. 25 , in all cases, the generated meshes present very good quality. Only a small percentage of the elements has poor quality. Fig. 26 indicates how much the meshes generated in parallel deviate from the mesh generated serially. For the Key, at most 0.10% of the elements had different a values, compared to the serial mesh. The meshes of the Cylinder and of the Plate presented differences of 0.32% and 0.10%, respectively .

This shows that the parallel techniqu e described in this work generate s a mesh of quality approximat ely equal to the mesh serially generate d, which is a very good characteristic, considering the quality improvements performed serially [3,9,10]. Also notice that a greater number of subdoma ins usually leads to a greater difference, even though this difference was just around 0.3% of the elements of the mesh.

5. Conclusi ons

This work presente d an Advancing Front Technique (AFT) for generating meshes in parallel using a master/slaves model. The

Fig. 21. Runtime for each phase: Key (left), Cylinder (center) and Plate (right).

Fig. 22. Percentage of runtime for each phase: Key (top), Cylinder (center) and Plate (bottom).

Fig. 23. Runtime for meshing phases: Key (left), Cylinder (center) and Plate (right).


Fig. 24. Percentage of runtime for meshing phases: Key (left), Cylinder (center) and Plate (right).

Fig. 25. Quality of the meshes: Key (top), Cylinder (center) and Plate (bottom).


estimation of the load for distribution of the meshing work is performed by a quadtree that is built for the given input boundary,which is defined by a sequence of mesh refinement segments.The domain is decompo sed in several squared subdomains also based on a quadtree, which takes into consideration the load and the number of slaves processes in the parallel system. These subdomains are sent to the slaves, which will generate mesh employ- ing a serial mesh generation algorithm for each subdomain.

After that, the resulting front is sent back to the master process,which will shift the decompositi on quadtree cells to a Cartesian direction, repartition the updated front, and send the newly placed

subdoma ins back to the slave processes , which, once again, will generate mesh serially. This shift-and-reme sh procedure continue suntil no more mesh can be generated, shifting the decomposition quadtree cells to different directions each turn.

The present impleme ntation presented a fairly good speed-up,balanced by the accurate load estimation, used for efficient load balancing, and maintains the quality of the generated mesh with respect to the serially generated mesh.

Some points are still of interest to this research. The amount of serial work performed by the finalizing step could be reduced by employin g shifts in other directions , to get rid of cavities such as

Fig. 26. Difference of the number of elements in each quality range, in percentage: Key (top), Cylinder (center) and Plate (bottom).


those shown in the left of the Fig. 12 ; and by applying the improvement of the layers of elements in parallel, taking more advantage of the Cartesian shifts.

It is also possible to remove the global synchronizatio n present in the shifting of the decompositi on quadtree. To shift a cell to adirection, it is necessary for a subdomain to know only its neigh- bors in that direction, along with their front segments. Therefore,a local shift can be employed only among the involved subdomains.

The idea presented in this work can also be naturally extended to three-dimensio nal mesh generation, and the authors are cur- rently working on a 3D version of this technique.

Acknowled gements

The first author would like to thank the Brazilian agency CAPES (Coordenação de Aperfeiçoamento de Pessoal de Nível Superior) for the fellowship granted to him (Process 2823-12-8). The third author acknowledges the support from the Brazilian agencies CNPq (Conselho Nacional de Desenvolvimen to Científico e Tecnológico)through the Research Productiv ity Grant 305596/2010-1 and CAPES through the grant BEX 6881/12-2. All the authors would like to thank CENAPAD- UFC (Centro Nacional de Computação de Alto Desempenh o - UFC) for granting access to its computational resources.

References

[1] Batista VHF, Millman DL, Pion S, Singler J. Parallel geometric algorithms for multi-core computers. Comput Geometry: Theory Appl 2010;43(8):663–77.

[2] Cavalcante-Neto JB, Martha LF, Wawrzynek PA, Ingraffea AR. A back-tracking procedure for optimization of simplex meshes. Commun Numer Methods Eng 2005;21(12):711–22.

[3] Cavalcante-Neto JB, Wawrzynek PA, de Carvalho MTM, Martha LF, Ingraffea AR.An algorithm for three-dimensional mesh generation for arbitrary regions with cracks. Eng Comput 2001;17(1):75–91.

[4] Chernikov AN, Chrisochoides NP. Parallel 2D graded guaranteed quality Delaunay mesh refinement. In: Proceedings of the 14th international meshing roundtable, Sandia National Laboratory, San Diego, United States;2005.

[5] Chernikov AN, Chrisochoides NP. Parallel guaranteed quality Delaunay uniform mesh refinement. SIAM J Sci Comput 2006;28(5):1907–26.

[6] Chrisochoides NP. Parallel mesh generation. Numerical solution of partial differential equations on parallel computers, vol. 51. Springer-Verlag; 2005. p.237–59.

[7] Chrisochoides NP, Nave D. Simultaneous mesh generation and partitioning for Delaunay meshes. Math Comput Simul 2000;54(4–5):321–39.

[8] De Cougny HL, Shephard MS. Parallel volume meshing using face removals and hierarchical repartitioning. Comput Methods Appl Mech Eng 1999;174(3–4):275–98.

[9] de Oliveira Miranda AC, Cavalcante-Neto JB, Martha LF. An algorithm for two- dimensional mesh generation for arbitrary regions with cracks. In: SIBGRAPI ’99: proceedings of the XII Brazilian symposium on computer graphics and image processing. IEEE Computer Society; 1999. p. 29–38.

[10] de Oliveira Miranda AC, Martha LF, Wawrzynek PA, Ingraffea AR. Surface mesh regeneration considering curvatures. Eng Comput 2009;25(2):207–19.

[11] Hodgson DC, Jimack PK. Efficient parallel generation of partitioned,unstructured meshes. Adv Eng Softw 1996;27(1–2):59–70.

http://refhub.elsevier.com/S0965-9978(13)00032-X/h0005



























[12] Ito Y, Shih AM, Erukala AK, Soni BK, Chernikov AN, Chrisochoides NP, et al.Parallel unstructured mesh generation by an advancing front method. Math Comput Simul 2007;75(5–6):200–9.

[13] Ivanov EG, Andrä H, Kudryavtsev AN. Domain decomposition approach for automatic parallel generation of tetrahedral grids. Comput Methods Appl Math 2006;6(2):178–93.

[14] Karypis G, Kumar V. METIS: a software package for partitioning unstructured graphs, partitioning meshes, and computing fill-reducing orderings of sparse matrices, version 4.0. University of Minnesota; September 1998.

[15] Karypis G, Schloegel K, Kumar V. ParMETIS: parallel graph partitioning and sparse matrix ordering library, version 3.2. University of Minnesota; April 2011.

[16] Khan AI, Topping BHV. Parallel adaptive mesh generation. Comput Syst Eng 1991;2(1):75–101.

[17] Kohout J, Kolingerová I, Zára J. Parallel Delaunay triangulation in E2 and E3 forcomputers with shared memory. Parallel Comput 2005;31(5):491–522.

[18] Lämmer L, Burghardt M. Parallel generation of triangular and quadrilateral meshes. Adv Eng Softw 2000;31(12):929–36.

[19] Larwood BG, Weatherill NP, Hassan O, Morgan K. Domain decomposition approach for parallel unstructured mesh generation. Int J Numer Methods Eng 2003;58:177–88.

[20] Linardakis L, Chrisochoides NP. Delaunay decoupling method for parallel guaranteed quality planar mesh refinement. SIAM J Sci Comput 2006;27(4):1394–423.

[21] Löhner R. A parallel advancing front grid generation scheme. Int J Numer Methods Eng 2001;51(6):663–78.

[22] Okusanya T, Peraire J. Parallel unstructured mesh generation. In: Proceedings of the 5th international conference on numerical grid generation in computational fluid dynamics and related fields. Mississippi State University; 1996. p. 719–29.

[23] Persson P-O. PDE-based gradient limiting for mesh size functions. In:Proceedings of the 13th international meshing roundtable. Williamsburg,United States: Sandia National Laboratory; 2004. p. 377–87.

[24] Ri va ra M- C, Ca ld ero n C, Fe do ro v A, Ch ri so ch oi de s NP . Pa ra llel de co up le d te rm in al -edg e bi se ct io n me th od fo r 3D mes h ge ne ra ti on . En g Co mpu t 20 06 ;2 2:111–9.

[25] Shewchuk JR. Delaunay refinement mesh generation. Ph.D. thesis, School of Computer Science, Carnegie Mellon University; 1997.

[26] Topping BHV, Cheng B. Parallel and distributed adaptive quadrilateral mesh generation. Comput Struct 1999;73(1–5):519–36.

[27] Topping BHV, Khan AI. Subdomain generation for non-convex parallel finiteelement domains. Adv Eng Softw 1996;25(2–3):253–66.

[28] Wilson JK, Topping BHV. Parallel adaptive tetrahedral mesh generation by the advancing front technique. Comput Struct 1998;68(1–3):57–78.

[29] Wu H, Guan X, Gong J. ParaStream: a parallel streaming Delaunay triangulation algorithm for LiDAR points on multicore architectures. Comput Geosci 2011;37(9):1355–63.

[30] Wu P, Houstis EN. Parallel adaptive mesh generation and decomposition. Eng Comput 1996;12(3–4):155–67.

[31] Zagaris G, Pirzadeh SZ, Chrisochoides NP. A framework for parallel unstructured grid generation for practical aerodynamic simulations. In:Proceedings of the 47th AIAA aerospace sciences meeting. AIAA – American Institute of Aeronautics and Astronautics; 2009.







































Documents

Advances in Engineering Software - PUC-Riowebserver2.tecgraf.puc-rio.br/~lfm/papers/Freitas... · after performing the shifts, the work of Löhner builds a new coar- ser octree to