Skip to content

Commit

Permalink
few fixes + Frederic N. comments
Browse files Browse the repository at this point in the history
  • Loading branch information
phtournier committed Oct 14, 2024
1 parent 4e398e2 commit 96cd12d
Show file tree
Hide file tree
Showing 3 changed files with 16 additions and 15 deletions.
8 changes: 4 additions & 4 deletions software/freefempp/WP1/WP1.tex
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ \subsection{Parallel Capabilities}

\textbf{Mesh partitioning} The next step of a FreeFEM simulation is to distribute the mesh among the MPI processes. This is done using either a user-defined partitioning or an automatic graph partitioner such as SCOTCH, METIS or ParMETIS. Only ParMETIS has parallel capabilities. For large scale problems, avoiding a large global mesh can be achieved by starting from a coarser description of the geometry using a coarser initial mesh, partitioning and distributing the coarse mesh, and finally splitting (element-wise) the local subdomain meshes in parallel to reach the desired level of refinement. Of course this is not always applicable depending on the complexity of the geometry or the nature of the input.

\textbf{Assembly} After the simulation mesh has been partitioned, it is distributed among the MPI processes as a (either non-overlapping or overlapping) decomposition into subdomains, with each MPI process in charge of a local subdomain mesh. The neighbor-to-neighbor mapping is computed locally in each MPI process using redundant information from the partitioned global mesh. Then, the linear system corresponding to the finite element discretization of the problem can be assembled in a distributed way, each MPI process assembling the local matrix corresponding to its subdomain independently. The distributed linear system can then be passed to the solver backend (such as HPDDM or PETSc), keeping track of the neighbor-to-neighbor mapping of the degrees of freedom.
\textbf{Assembly} After the simulation mesh has been partitioned, it is distributed among the MPI processes as a (either non-overlapping or overlapping) decomposition into subdomains, with each MPI process in charge of a local subdomain mesh. The neighbor-to-neighbor mapping is computed locally in each MPI process using redundant information from the partitioned global mesh. Then, the linear system corresponding to the finite element discretization of the problem can be assembled in a distributed way, each MPI process assembling the local matrix corresponding to its subdomain independently. The distributed linear system can then be passed to the solver backend such as HPDDM, PETSc or the in-house ffddm (FreeFEM DD Methods) library, keeping track of the neighbor-to-neighbor mapping of the degrees of freedom.

\begin{itemize}
\item \textbf{Parallel Environment:} MPI
Expand Down Expand Up @@ -125,10 +125,10 @@ \subsection{Initial Performance Metrics}

\subsubsection{Benchmark \#1: 3D linear elasticity}

This benchmark consists in solving the 3D linear isotropic heterogeneous static elasticity equations discretized with vector $P_1$, $P_2$ and $P_3$ elements for a sequence of increasingly finer meshes. The computational domain is a cylinder of inner radius 0.8, outer radius 1 and height 3. Gravity is $g = 9.81$, density is $\rho = 7700$. The cylinder is composed of 10 alternating vertical layers of two materials with different mechanical properties $(E_1, nu_1) = (10^7, 0.45)$ and $(E_2, nu_2) = (10^9, 0.35)$. An essential boundary condition of null displacement is imposed on the bottom end of the cylinder, while a downwards vertical force $f = 5 \times 10^5$ is applied on the top boundary with a natural boundary condition.\\
This benchmark consists in solving the 3D linear isotropic heterogeneous static elasticity equations discretized with vector $P_1$, $P_2$ and $P_3$ elements for a sequence of increasingly finer meshes. The computational domain is a hollow cylinder of inner radius 0.8, outer radius 1 and height 3. Gravity is $g = 9.81$, density is $\rho = 7700$. The cylinder is composed of 10 alternating vertical layers of two materials with different mechanical properties $(E_1, nu_1) = (10^7, 0.45)$ and $(E_2, nu_2) = (10^9, 0.35)$. An essential boundary condition of null displacement is imposed on the bottom end of the cylinder, while a downwards vertical force $f = 5 \times 10^5$ is applied on the top boundary with a natural boundary condition.\\
The problem is discretized with either $P_1, P_2$ or $P_3$ vector elements. The mesh size is proportional to an input parameter ${n}_h$.\\

The benchmark consists in a FreeFEM script publicly available in the distribution and can be found at~\url{https://github.com/FreeFem/FreeFem-sources/blob/master/examples/hpddm/XXX.edp}. The script outputs the computing time of the different steps of the simulation (mesh generation, mesh partitioning, assembly, solution of the linear system) for a given mesh size parameter ${n}_h$ and polynomial order that the user can specify as command-line parameters.\\
The benchmark consists in a FreeFEM script publicly available in the distribution and can be found at~\url{https://github.com/FreeFem/FreeFem-sources/blob/develop/examples/hpddm/elasticity-3d-cylinder-PETSc.edp}. The script outputs the computing time of the different steps of the simulation (mesh generation, mesh partitioning, assembly, solution of the linear system) for a given mesh size parameter ${n}_h$ and polynomial order that the user can specify as command-line parameters.\\

The mesh is built by first building the 2D base of the cylinder using the \textit{bamg} internal mesh generator and then extruding it using the \textit{buildlayers} command. The mesh is then partitioned using the automatic graph partitioner ParMETIS. For the solver, we use GMRES with a relative tolerance of $10^{-5}$ preconditioned by a two-level GenEO preconditioner built through the PETSc interface of HPDDM. The GenEO coarse space is built by taking the first 20 GenEO eigenvectors in each subdomain.

Expand All @@ -154,7 +154,7 @@ \subsubsection{Benchmark \#1: 3D linear elasticity}
\label{tab:elasticity}
\end{table}

In Table~\cref{tab:elasticity} we report a weak scaling experiment, where the number of CPU cores scales with ${n}_h^3$ so as to keep the local problem sizes constant. We report the computing time of the different steps of the simulation: mesh construction, mesh partitioning, assembly and solution of the linear system for $P_1, P_2$ and $P_3$ discretizations. The number of mesh elements varies from $\pgfmathprintnumber{55920}$ to $7.6$ million, with a corresponding increase of the number of CPU cores from 24 to 3000. For the finest mesh, the number of unknowns reaches 4 million for $P_1$, 31 million for $P_2$ and 105 million for $P_3$.
In~\cref{tab:elasticity} we report a weak scaling experiment, where the number of CPU cores scales with ${n}_h^3$ so as to keep the local problem sizes constant. We report the computing time of the different steps of the simulation: mesh construction, mesh partitioning, assembly and solution of the linear system for $P_1, P_2$ and $P_3$ discretizations. The number of mesh elements varies from $\pgfmathprintnumber{55920}$ to $7.6$ million, with a corresponding increase of the number of CPU cores from 24 to 3000. For the finest mesh, the number of unknowns reaches 4 million for $P_1$, 31 million for $P_2$ and 105 million for $P_3$.

\textbf{Hardware and software setting:} The experiment is done on the Skylake partition of the Irene supercomputer. FreeFEM is compiled using the Intel 20.0.4 compiler suite.

Expand Down
16 changes: 8 additions & 8 deletions software/freefempp/WP3/WP3.tex
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ \subsection{Parallel Capabilities}
\label{sec:WP3:Freefem++:performances}

FreeFEM uses either direct or iterative solvers for the solution of the linear systems. In both cases, it is interfaced with high-performance CPU-based linear solver libraries using MPI for parallelism. For direct methods, it is interfaced with the multifrontal solver MUMPS. For large scale computations, FreeFEM relies on its extensive interface to the PETSc library. In particular, it can use the high-performance domain decomposition library HPDDM as the linear solver backend, which is available both through PETSc and through its own FreeFEM interface. After the input simulation mesh has been partitioned and distributed among the MPI processes, the linear system is naturally assembled and passed through the solver backend in a distributed way.\\
More details about scalability are given in the next section and in Section~\cref{sec:WP3:HPDDM:software}
More details about scalability are given in the next section and in~\cref{sec:WP3:HPDDM:software}

\begin{itemize}
\item \textbf{Parallel Environment:} MPI
Expand All @@ -85,17 +85,17 @@ \subsection{Parallel Capabilities}
\subsection{Initial Performance Metrics}
\label{sec:WP3:Freefem++:metrics}

Here we present the current state of the software concering the performance of the solution step for problems involving various physics. For large scale problems, the main concern is the efficiency of the preconditioner, both in terms of number of iterations and parallel efficiency. For problems with provably robust preconditioners such as SPD problems preconditioned by a multi-level GenEO domain-decomposition method, the scalability is very good up to several thousands CPU cores, where the increase in coarse space size starts to hinder parallel efficiency (see Section~\cref{sec:WP3:HPDDM:software}). On the other hand, there are physics for which designing provably robust and efficient solvers is still a current open challenge.\\
Here we present the current state of the software concering the performance of the solution step for problems involving various physics. For large scale problems, the main concern is the efficiency of the preconditioner, both in terms of number of iterations and parallel efficiency. For problems with provably robust preconditioners such as SPD problems preconditioned by a multi-level GenEO domain-decomposition method, the scalability is very good up to several thousands CPU cores, where the increase in coarse space size starts to hinder parallel efficiency (see~\cref{sec:WP3:HPDDM:software}). On the other hand, there are physics for which designing provably robust and efficient solvers is still a current open challenge.\\

The first benchmark presented in this section illustrates this difficulty in the context of high-frequency electromagnetic wave propagation problems. We consider a very simple setting of scattering of a point source in a homogeneous domain. The benchmark solves the time-harmonic second-order Maxwell's equations in a cube, where the source is a gaussian point source in the center of the cube, with lowest-order absorbing boundary conditions on all 6 faces.\\
The benchmark consists in a FreeFEM script publicly available in the distribution. The script outputs the number of iterations and computing time for a specific frequency that the user can specify as a command-line parameter. The performance can be evaluated through the PETSc output logs.\\

\begin{itemize}
\item \textbf{Overall Performance:} FreeFEM relies on high-performance solver libraries such as HPDDM and PETSc for the efficient solution of large-scale linear systems. State-of-the-art multilevel domain-decomposition solvers have been shown to be robust and scalable up to several thousand CPU cores (see Section~\cref{sec:WP3:HPDDM:software}). However, designing robust and efficient solvers for difficult physics, such as high-frequency wave propagation problems, remains an open challenge. FreeFEM is the ideal framework to try and tackle these open problems: it allows for a quick and easy description and set up of test cases involving various physics, and it is developed hand-in-hand with our research in scientific computing and parallel solvers. The FreeFEM/ffddm/HPDDM/PETSc ecosystem greatly facilitates the development of new methods, from research and prototyping all the way to HPC implementation.
\item \textbf{Overall Performance:} FreeFEM relies on high-performance solver libraries such as HPDDM and PETSc for the efficient solution of large-scale linear systems. State-of-the-art multilevel domain-decomposition solvers have been shown to be robust and scalable up to several thousand CPU cores (see~\cref{sec:WP3:HPDDM:software}). However, designing robust and efficient solvers for difficult physics, such as high-frequency wave propagation problems, remains an open challenge. FreeFEM is the ideal framework to try and tackle these open problems: it allows for a quick and easy description and set up of test cases involving various physics, and it is developed hand-in-hand with our research in scientific computing and parallel solvers. The FreeFEM/ffddm/HPDDM/PETSc ecosystem greatly facilitates the development of new methods, from research and prototyping all the way to HPC implementation.
\item \textbf{Input/Output Datasets:} Currently, the dataset for each benchmark consists in the FreeFEM script used to define and solve the corresponding problem. The scripts are open access and available in the FreeFEM distribution.
\item \textbf{Challenges:}
\begin{itemize}
\item definition of the geometry of the problem: relying on the construction of a global mesh which is then partitioned into subdomains and distributed among MPI processes is impratical for large problems. This is discussed in more details in Section~\cref{sec:WP1:Freefem++:software}.
\item definition of the geometry of the problem: relying on the construction of a global mesh which is then partitioned into subdomains and distributed among MPI processes is impratical for large problems. This is discussed in more details in~\cref{sec:WP1:Freefem++:software}.
\item for difficult physics such as high-frequency wave propagation, designing robust and efficient preconditioners is still an open challenge. In particular, in the context of multi-level domain decomposition methods, the construction of efficient coarse spaces of reasonable size can be challenging.
\item currently, the benchmark datasets consist only in the FreeFEM script generating and solving the problem. A discussion should be made to provide the (possibly very large) linear system to solve as a benchmark input, allowing to compare results with other solver libraries and software. The price is then the loss of information about the physics of the problem, effectively discarding non-algebraic solvers. Existing formats such as PETSc matrix export formats should be investigated.
\end{itemize}
Expand All @@ -119,7 +119,7 @@ \subsubsection{Benchmark \#1: Time-harmonic Maxwell equations at high frequency}
\end{equation}

where $k = 2\pi f$, $f \in \mathbb{R}^*_+$ is the frequency, $\Omega$ is the unit cube, $\partial\Omega$ are the boundary faces of $\Omega$, $\textbf{n(x)}$ is the outward unit normal vector to $\partial\Omega$, and the source term is $\textbf{f(x)} = [0,0,\exp^{-50 k [(x-0.5)^2 + (y-0.5)^2 + (z-0.5)^2]}]$.\\
The problem is discretized with lowest-order Nedelec edge finite elements. The unit cube is discretized with a regular tetrahedral mesh (each elementary cube is cut into 6 tetrahedra, with 1 diagonal priviledged) with a constant mesh size corresponding to 10 points per wavelength $\lambda = \frac{1}{f}$.
The problem is discretized with lowest-order Nedelec edge finite elements. The unit cube is discretized with a regular tetrahedral mesh (each elementary cube is cut into 6 tetrahedra, with 1 diagonal privileged) with a constant mesh size corresponding to 10 points per wavelength $\lambda = \frac{1}{f}$.


The benchmark consists in a FreeFEM script publicly available in the distribution and can be found at~\url{https://github.com/FreeFem/FreeFem-sources/blob/master/examples/hpddm/maxwell-mg-3d-PETSc-complex.edp}. The script outputs the number of iterations and computing time for a specific frequency $f$ that the user can specify as a command-line parameter. The performance can be evaluated through the PETSc output logs. The number of iterations of the Krylov solver, the FLOPS and the execution time are reported for increasing frequencies.\\
Expand Down Expand Up @@ -151,22 +151,22 @@ \subsubsection{Benchmark \#1: Time-harmonic Maxwell equations at high frequency}
\label{tab:maxwell}
\end{table}

In Table~\cref{tab:maxwell} we report a weak scaling experiment, where the number of CPU cores scales with $f^3$ so as to keep the local problem sizes constant. We report the computing time of the different steps of the simulation: mesh partitioning, assembly of the linear system, construction of the preconditioner, solution of the linear system, VTK export of the solution. The frequency varies from 8 to 32, with a corresponding increase of the number of unknowns from 3.6 million to 230 million. The number of cores increases from 48 to 3072.
In~\cref{tab:maxwell} we report a weak scaling experiment, where the number of CPU cores scales with $f^3$ so as to keep the local problem sizes constant. We report the computing time of the different steps of the simulation: mesh partitioning, assembly of the linear system, construction of the preconditioner, solution of the linear system, VTK export of the solution. The frequency varies from 8 to 32, with a corresponding increase of the number of unknowns from 3.6 million to 230 million. The number of cores increases from 48 to 3072.

\textbf{Hardware and software setting:} The experiment is done on the Skylake partition of the Irene supercomputer. FreeFEM is compiled using the Intel 20.0.4 compiler suite.

The results can be summarized as follows:

\begin{itemize}
\item The parallel assembly of the linear system, the setup of the preconditioner and the export to VTK are all local computations and operations done independently on each subdomain. The local matrices are assembled on the local meshes of the subdomains, and the construction of the preconditioner mainly consists in performing the LU factorization of the two local matrices (corresponding to the fine and coarse levels) using the direct solver \textit{MUMPS}. It is then natural that the computing time corresponding to these steps stays approximately constant throughout the weak scaling experiment.
\item As discussed in Section~\cref{{sec:WP1:Freefem++:software}}, the cost of the initial partitioning of the global mesh using \textit{ParMETIS} increases. This can become a bottleneck for large scale simulations. Moreover, without relying on an initial coarsening of the global mesh, the memory cost of holding the mesh becomes intractable. This is evidenced here, where going higher than $f=32$ produces an out-of-memory error.
\item As discussed in~\cref{{sec:WP1:Freefem++:software}}, the cost of the initial partitioning of the global mesh using \textit{ParMETIS} increases. This can become a bottleneck for large scale simulations. Moreover, without relying on an initial coarsening of the global mesh, the memory cost of holding the mesh becomes intractable. This is evidenced here, where going higher than $f=32$ produces an out-of-memory error.
\item Even though the setup cost of the preconditioner is cheap and scalable, the number of iterations increases with frequency, from 7 iterations for $f=8$ to 45 iterations for $f=32$. Correspondingly, the computing time of the solution step increases from 24 to 148 seconds. The increase is a little more than linear with respect to frequency.
\end{itemize}

\subsection{12-Month Roadmap}
\label{sec:WP3:Freefem++:roadmap}

See Section~\cref{sec:WP1:Freefem++:roadmap} for the plans regarding the definition of the initial geometry and mesh in a distributed setting for large scale problems.\\
See~\cref{sec:WP1:Freefem++:roadmap} for the plans regarding the definition of the initial geometry and mesh in a distributed setting for large scale problems.\\

Currently, FreeFEM relies on a set of high-level macros written in the FreeFEM language that the user has to call for the initial mesh partitioning and decomposition. In combination with the point above concerning distributed mesh generation, we will define parallel data structures for handling distributed meshes and finite element spaces. The goal is to allow the user to go from a sequential script to a parallel script in an almost completely transparent way and with minimal changes to the script, e.g. by only changing the type of the mesh variable from 'mesh' to 'Dmesh'.\\

Expand Down
Loading

0 comments on commit 96cd12d

Please sign in to comment.