Performance study on Vilje

Performance study on Vilje

The parallel performance of OpenFOAM has been investigated on the HPC system Vilje. Several benchmark cases have been investigated, and the old 1.7.1 and the new 2.1.1 versions of OpenFOAM has been compared. In addition to that the performance of two different linear system solvers, the PCG (Preconditioned Conjugate Gradient) and GAMG (Geometric-Algebraic Multi-Grid), have been studied in detail.

Vilje is a massivly parallel SGI Altix Ice X computer with 1404 nodes, each with two eight-core Intel Xeon E5-2670 CPU's and 32 GB of memory per node. The interconnect is FDR and FDR-10 Infiniband. It is worth noting that because of the design of Vilje, the large parallel cases (number of nodes > 4) are decomposed as multiples of 9 nodes. This is because the nodes are grouped in IRU's of 18 nodes each where each IRU has two switches connecting 9 and 9 nodes together. Filling these IRUs completely is beneficial with respect to both communication and fragmentation of the job queue.

Unless otherwise is explicit written, all cases are run with OpenFOAM version 2.1.1.

3D cavity tutorial with icoFoam

The classic cavity tutorial supplied with OpenFOAM is extended from two to three dimensions and used as a benchmark. The front and back patches are converted to walls, so that the domain is a cube with 5 stationary walls and one moving wall. The Reynolds number has been increased from 10 to 1000. Some important parameters of the simulation are:

Reynolds number	1000
Kinematic viscosity	0.0001 m^2/s
Cube dimension	0.1x0.1x0.1 m
Lid velocity	1 m/s
deltaT	0.0001 s
Number of time steps	200
Solutions written to disk	8
Solver for pressure eqn.	PCG w/ DIC
Decomposition method	Simple

The other parameters are unchanged from the case supplied with the OpenFOAM distribution. The following matrix shows the investigated cases (mesh sizes in millions is indicated by suffix M and the number of nodes is indicated by suffix N, the number of MPI Processes is N times 16):

	1M	3,4M	8M	15,6M	27M
1N	Yes	Yes	Yes	Yes	No
2N	Yes	Yes	Yes	Yes	Yes
4N	Yes	Yes	Yes	Yes	Yes
9N	Yes	Yes	Yes	Yes	Yes
18N	Yes	Yes	Yes	Yes	Yes
27N	Yes	Yes	Yes	Yes	Yes
36N	Yes	Yes	Yes	Yes	Yes
72N	Yes	Yes	Yes	Yes	Yes
144N	Yes	Yes	Yes	Yes	Yes
288N	No	No	Yes	Yes	Yes

The reason why the combination of the 27-million cell mesh and one node is not run is because one single compute nodes does not have sufficient amount of memory to run this case. The combination of the two smallest meshes and 288 nodes are not run either.

The results from this scaling study is presented as plots indicating speedup and parallel efficiency. All results are based on total analysis time, including all startup overhead:

The speedup and parallel efficiency is calculated with the lowest number of nodes as a reference, i.e. speedup relative to 1 node for all meshes except the 27 million cell mesh, where the speedup is relative to 2 nodes. The trend is clear, the speedup is first linear, then gradually superlinear up to a certain point where it suddenly drops. The smallest cases does not show the linear behavior, to see this we probably would have to run it with fewer processes than 16. One important feature of the plots is that the top of the speedup graph is at a point where the number of cells per process is surprisingly low, between 3000 and 12000.

Another interesting study is to replace the PCG solver used for the pressure equation above with a GAMG solver. From ref. [1] we expect that this gives reduced analysis times, but a strong penalty in lack of scalability:

The results presented above shows that the GAMG solver is much faster than the PCG solver when the number of processes is small, but the scalability is poor. The total analysis time for the GAMG solver is 980 seconds for the 8 million cell case, and the PCG solver is 5567 seconds when run on one node (16 processes). When the number of nodes is increased to 9 (144 processes), the execution times are 179 and 455 seconds, respectively. Above this level the GAMG solver completely fails to scale, while the PCG solver continues to scale super-linear up to 72 nodes (1152 processes). Note that the tolerance settings for the solvers are identical in both cases.

The case files together with the numerical results can be downloaded in cavity.tar.gz

3D pitzDaily with pisoFoam

The pitzDaily tutorial supplied with OpenFOAM is an example of a LES simulation. The solver used is pisoFoam together with the oneEqEddy eddy viscosity model. The case can be seen at Youtube. This is the same case as used in ref. [1], with a small difference in geometry.

As with the previous case the mesh has been extended from two to three dimensions, and the added sidewalls are treated like solid walls (no slip conditions). All postprocessing, statistics and sampling probes are removed from the analysis. The flow parameters (velocity, viscosity) are left unchanged. Some important control parameters are:

deltaT	1e-6 s
Number of time steps	200
Solutions written to disk	8
Solver for pressure eqn.	PCG w/ DIC
Decomposition method	Scotch

The mesh and decomposition configuration is as follows:

	1M	2M	4M	8M	16M
1N	Yes	Yes	Yes	Yes	Yes
2N	Yes	Yes	Yes	Yes	Yes
4N	Yes	Yes	Yes	Yes	Yes
9N	Yes	Yes	Yes	Yes	Yes
18N	Yes	Yes	Yes	Yes	Yes
27N	Yes	Yes	Yes	Yes	Yes
36N	Yes	Yes	Yes	Yes	Yes
72N	Yes	Yes	Yes	Yes	Yes
144N	Yes	Yes	Yes	Yes	Yes
288N	No	No	Yes	Yes	Yes

The results shown here, both for the cavity and pitzDaily case, indicate that the cases and solvers we have tested scale very well with the PCG solver. The workload per process can be as low as 4000-10000 cells. The reason for the superlinear speedup might be the large L3 caches of 20 MB on each CPU chip. We suspect a major part of the working arrays might fit in the L3 at each stage of the computation when distributed over a large number of nodes. When the simulations are run at few nodes, the memory bandwidth might be a limiting factor, witch is eliminated gradually when the number of cache misses go down with the increased number of participating processors. Remember that when using for example 72 nodes, the total accumulated L3 cache is 20 MB/CPU x 2 CPUs/node x 72 nodes = 2880 MB, on which we can fit approximately 47 million doubles. Again this speedup behavior is consistent with other's findings in for example ref. [1] and [2].

The case files together with the numerical results can be downloaded in pitzDaily.tar.gz

Comparing OF 1.7.1 with OF 2.1.1

Two of the cases from the pitzDaily case above is picked as references to compare the performance of OpenFOAM version 1.7.1 and 2.1.1 on Vilje.

As the graph shows, OpenFOAM version 2.1.1 performs a little better than 1.7.1 when it come to large parallel cases. Note that the runTimeModifiable switch is on in both cases, and the difference in execution time might just be that version 2.x.x handles this better than previous versions (witch is confirmed from the developers).

"CX-bencmark" with GAMG solver

The case simulated here is a incompressible 3D LES simulation using pisoFoam. The setup is external airflow around the superstructure of a ship "CX". The mesh is created with snappyHexMesh, and afterward redistributed with the scotch decomposition. No run-time postprocessing (like calculating drag forces) is performed. This is the same case that was used in ref. [3], except small modifications to the mesh and boundary conditions. The runTimeModifiable switch is off in all runs, and the case is simulated for 200 timesteps (up from 100 in ref. [3]). Two new meshes of 10 and 18 million cells is also added to show the behavior of the solver when the problem size increases.

This benchmark is used to investigate the difference between the GAMG and PCG linear solvers:

These results re consistent with what ref. [1] shows, and the general scaling of the case with the GAMG solver is the same as in ref. [3]. It is evident that the GAMG solver is a lot faster than the PCG solver for the same tolerance levels. Although the parallel efficiency is far lower than the PCG solver, the GAMG should be the preferred choice because the total analysis time when using the GAMG solver is about 30% of the total time when using the PCG solver.

Analysis of the GAMG solver

The effect of startup overhead

As we have shown in the previous chapters, there are scaling problems related to the GAMG solver. One thing we suspect is that we might have introduced a large startup overhead, because the linear solver must calculate the different coarsening levels before solving the first timestep (the different grids are cached between timesteps). To investigate this a new and identical analysis was done, but the number of timesteps was increased one order of magnitude, from 200 to 2000.

At the highest number of processes, increasing the number of timesteps give significantly better scaling. Despite this, the scaling is still not good.

Important: There is nothing in these results that indicate that the GAMG solver alone is responsible for the startup overhead, the results with the PCG solver might as well be affected by this. The startup overhead can in fact be a general OpenFOAM behavior, independent of the linear system solver(s). This is not investigated further here.

Profiling with IPM

The IPM profiling tool is described on a separate page, and we will use the modified pisoFoamIPM solver to see where the scalability bottlenecks are. We will use the same regions as defined in the profiling page. The case is the "CX" bencmark decribed above.

The results confirm that it's the solving of the pressure equation that limits the speedup. It is also interesting to note that all other parts of the solution process scale well, in fact all other parts than the pressure solving scale superlinearly, like we found in the cavity and pitzDaily benchmarks. The contError and momentumCorrector phases has been taken out of the graphs to increase the readability, since they are marginal when it comes to time compared to the other sections.

Closer investigation of the results from IPM show that it is a number of calls to MPI_Allreduce that limits the speedup. Again this is consistent with the results from reference [1]. In the 36 node case the solver waits for calls to MPI_Allreduce to finish 48% of the total time it uses to solve the pressure equations, and in the 72 node case MPI_Allreduce actually uses 59% of the total time in pressureCorrector. Total communication time (all MPI communication calls) in the pressure solving part is 75% and 83% for the 36 and 72 node case respectively.

Unfortunately we cannot provide the complete case files for this case here. You can still download some relevant files and results in CX-pisoFoamPublic.tar.gz

Concluding remarks

OpenFOAM as an application framework scales well on the massively parallel computer Vilje.
The choice of linear equation solver(s) has a strong impact on both running time and parallel efficiency.
For the investigated incompressible flow scenarios, specifying the GAMG solver for the pressure equation completely outperforms the PCG solver when the number of processes is small. The GAMG linear solver should be the first choice for any (incompressible) analysis due to the low running time if the number of processes can be held at a moderate level (about up to 9 nodes/144 processes).
The GAMG linear solver scales badly, and should be used with caution for large parallel cases (about 18 nodes/288 processes and above) without doing further benchmarks on each individual case.

References

The following documents, websites and publications are used as a reference in this work:

Parallel Aspects of OpenFOAM with Large Eddy Simulations, Orlando Rivera, Leibniz Supercomputing Centre, Karl Furlinger, Ludwig-Maximilians-Universitat
Porting OpenFOAM to HECToR, Gavin J. Pringle, EPCC, The University of Edinburgh
I/O-profiling with Darshan, Bjørn Lindi, NTNU