Vectorize `coordinates._construct_face_centroids()` #1117

erogluorhan · 2024-12-21T16:30:27Z

Closes #1116.

Overview

Optimizes coordinates.py/_construct_face_centroids() by replacing the for-loop with vectorization with the help of a masking method thanks to @hongyuchen1030 's older suggestion

Further optimization of the code would be possible, but maybe it is handled in future PRs.

FYI @philipc2 : The masking method this PR uses might be useful for cases we previously dealt with via partitioning, e.g. the nodal_averaging problem we look into last year

FYI @rajeeja : This is where the cartesian-based centroid calculations are being fixed. Welzl is a bit different but you may still want to look into it.

Results

#1116 shows the profiling results for the existing code for a 4GB SCREAM dataset where the execution time is around 5 mins.

The optimized code gives around 6 seconds, see below:

PR Checklist

General

An issue is linked created and linked
Add appropriate labels
Filled out Overview and Expected Usage (if applicable) sections

Testing

Tests cover all possible logical paths in your function

rajeeja · 2024-12-21T16:54:13Z

Good fix, will test welzl and see appropriate fix.

github-actions · 2024-12-21T18:23:00Z

ASV Benchmarking

Benchmark Comparison Results

Benchmarks that have improved:

Change	Before [`0e45eda`]	After [`832aae7`]	Ratio	Benchmark (Parameter)
-	452M	408M	0.9	face_bounds.FaceBounds.peakmem_face_bounds(PosixPath('/home/runner/work/uxarray/uxarray/test/meshfiles/ugrid/geoflow-small/grid.nc'))
-	640M	398M	0.62	face_bounds.FaceBounds.peakmem_face_bounds(PosixPath('/home/runner/work/uxarray/uxarray/test/meshfiles/ugrid/quad-hexagon/grid.nc'))
-	488M	390M	0.8	mpas_ocean.Integrate.peakmem_integrate('480km')

Benchmarks that have stayed the same:

Before [`0e45eda`]	After [`832aae7`]	Ratio	Benchmark (Parameter)
404M	404M	1.00	face_bounds.FaceBounds.peakmem_face_bounds(PosixPath('/home/runner/work/uxarray/uxarray/test/meshfiles/mpas/QU/oQU480.231010.nc'))
439M	443M	1.01	face_bounds.FaceBounds.peakmem_face_bounds(PosixPath('/home/runner/work/uxarray/uxarray/test/meshfiles/scrip/outCSne8/outCSne8.nc'))
19.3±0.1ms	19.3±0.1ms	1.00	face_bounds.FaceBounds.time_face_bounds(PosixPath('/home/runner/work/uxarray/uxarray/test/meshfiles/mpas/QU/oQU480.231010.nc'))
7.84±0.1ms	7.65±0.1ms	0.97	face_bounds.FaceBounds.time_face_bounds(PosixPath('/home/runner/work/uxarray/uxarray/test/meshfiles/scrip/outCSne8/outCSne8.nc'))
44.5±0.3ms	44.3±0.2ms	0.99	face_bounds.FaceBounds.time_face_bounds(PosixPath('/home/runner/work/uxarray/uxarray/test/meshfiles/ugrid/geoflow-small/grid.nc'))
4.10±0.09ms	4.08±0.1ms	0.99	face_bounds.FaceBounds.time_face_bounds(PosixPath('/home/runner/work/uxarray/uxarray/test/meshfiles/ugrid/quad-hexagon/grid.nc'))
3.18±0.04s	3.21±0.04s	1.01	import.Imports.timeraw_import_uxarray
691±6μs	661±5μs	0.96	mpas_ocean.CheckNorm.time_check_norm('120km')
446±3μs	451±3μs	1.01	mpas_ocean.CheckNorm.time_check_norm('480km')
686±20ms	698±3ms	1.02	mpas_ocean.ConnectivityConstruction.time_face_face_connectivity('120km')
43.4±0.8ms	43.7±0.5ms	1.01	mpas_ocean.ConnectivityConstruction.time_face_face_connectivity('480km')
1.86±0.02ms	1.90±0.03ms	1.02	mpas_ocean.ConnectivityConstruction.time_n_nodes_per_face('120km')
702±80μs	603±20μs	~0.86	mpas_ocean.ConnectivityConstruction.time_n_nodes_per_face('480km')
6.28±0.04ms	6.25±0.09ms	0.99	mpas_ocean.ConstructFaceLatLon.time_cartesian_averaging('120km')
3.70±0.05ms	3.70±0.06ms	1.00	mpas_ocean.ConstructFaceLatLon.time_cartesian_averaging('480km')
3.58±0.02s	3.62±0.02s	1.01	mpas_ocean.ConstructFaceLatLon.time_welzl('120km')
230±0.7ms	230±1ms	1.00	mpas_ocean.ConstructFaceLatLon.time_welzl('480km')
1.20±0.01μs	1.31±0μs	1.09	mpas_ocean.ConstructTreeStructures.time_ball_tree('120km')
297±1ns	313±3ns	1.05	mpas_ocean.ConstructTreeStructures.time_ball_tree('480km')
807±6ns	815±2ns	1.01	mpas_ocean.ConstructTreeStructures.time_kd_tree('120km')
284±3ns	300±1ns	1.06	mpas_ocean.ConstructTreeStructures.time_kd_tree('480km')
442±9ms	460±7ms	1.04	mpas_ocean.CrossSection.time_const_lat('120km', 1)
227±1ms	236±4ms	1.04	mpas_ocean.CrossSection.time_const_lat('120km', 2)
118±2ms	122±1ms	1.03	mpas_ocean.CrossSection.time_const_lat('120km', 4)
366±3ms	375±5ms	1.02	mpas_ocean.CrossSection.time_const_lat('480km', 1)
181±3ms	186±3ms	1.03	mpas_ocean.CrossSection.time_const_lat('480km', 2)
94.2±0.7ms	95.8±2ms	1.02	mpas_ocean.CrossSection.time_const_lat('480km', 4)
127±2ms	127±0.5ms	1.00	mpas_ocean.DualMesh.time_dual_mesh_construction('120km')
9.21±0.3ms	9.12±0.5ms	0.99	mpas_ocean.DualMesh.time_dual_mesh_construction('480km')
1.05±0.01s	1.09±0.01s	1.03	mpas_ocean.GeoDataFrame.time_to_geodataframe('120km', False)
53.7±1ms	55.1±0.7ms	1.03	mpas_ocean.GeoDataFrame.time_to_geodataframe('120km', True)
85.2±0.6ms	87.5±1ms	1.03	mpas_ocean.GeoDataFrame.time_to_geodataframe('480km', False)
5.66±0.2ms	5.89±0.1ms	1.04	mpas_ocean.GeoDataFrame.time_to_geodataframe('480km', True)
345M	343M	0.99	mpas_ocean.Gradient.peakmem_gradient('120km')
321M	320M	1.00	mpas_ocean.Gradient.peakmem_gradient('480km')
2.85±0.04ms	2.82±0.03ms	0.99	mpas_ocean.Gradient.time_gradient('120km')
319±8μs	321±5μs	1.01	mpas_ocean.Gradient.time_gradient('480km')
230±10μs	247±3μs	1.08	mpas_ocean.HoleEdgeIndices.time_construct_hole_edge_indices('120km')
116±0.7μs	119±0.6μs	1.03	mpas_ocean.HoleEdgeIndices.time_construct_hole_edge_indices('480km')
406M	406M	1.00	mpas_ocean.Integrate.peakmem_integrate('120km')
178±2ms	177±0.7ms	0.99	mpas_ocean.Integrate.time_integrate('120km')
12.1±0.06ms	12.0±0.03ms	0.99	mpas_ocean.Integrate.time_integrate('480km')
359±3ms	368±2ms	1.03	mpas_ocean.MatplotlibConversion.time_dataarray_to_polycollection('120km', 'exclude')
357±0.9ms	365±2ms	1.02	mpas_ocean.MatplotlibConversion.time_dataarray_to_polycollection('120km', 'include')
370±10ms	366±2ms	0.99	mpas_ocean.MatplotlibConversion.time_dataarray_to_polycollection('120km', 'split')
24.7±0.2ms	24.1±0.6ms	0.98	mpas_ocean.MatplotlibConversion.time_dataarray_to_polycollection('480km', 'exclude')
24.6±0.2ms	24.5±0.2ms	1.00	mpas_ocean.MatplotlibConversion.time_dataarray_to_polycollection('480km', 'include')
24.3±0.3ms	24.2±0.5ms	1.00	mpas_ocean.MatplotlibConversion.time_dataarray_to_polycollection('480km', 'split')
56.3±0.3ms	56.4±0.1ms	1.00	mpas_ocean.RemapDownsample.time_inverse_distance_weighted_remapping
45.8±0.1ms	46.2±0.2ms	1.01	mpas_ocean.RemapDownsample.time_nearest_neighbor_remapping
360±1ms	361±0.9ms	1.00	mpas_ocean.RemapUpsample.time_inverse_distance_weighted_remapping
266±0.8ms	266±0.4ms	1.00	mpas_ocean.RemapUpsample.time_nearest_neighbor_remapping
318M	316M	0.99	quad_hexagon.QuadHexagon.peakmem_open_dataset
316M	316M	1.00	quad_hexagon.QuadHexagon.peakmem_open_grid
7.09±0.2ms	7.03±0.3ms	0.99	quad_hexagon.QuadHexagon.time_open_dataset
5.78±0.1ms	5.81±0.07ms	1.00	quad_hexagon.QuadHexagon.time_open_grid

philipc2 · 2024-12-21T18:33:53Z

Nice work @erogluorhan

It appears that a few ASV benchmarks are failing (may or may not be due to the changes in this PR). I can look at it deeper into it on Monday.

erogluorhan · 2024-12-21T19:41:55Z

All sound good, thanks both @rajeeja @philipc2 !

erogluorhan · 2024-12-21T19:43:28Z

I can't see any failing benchmarks though! @philipc2

philipc2 · 2024-12-21T19:45:46Z

I can't see any failing benchmarks though! @philipc2

If you take a look at the GitHub Bot above, it shows which ones failed. If they failed before and after, it means something was wrong before this PR.

Looks like this PR didn't break anything, must have been something with the benchmark written before.

erogluorhan · 2024-12-21T20:21:27Z

Ok, I see what you mean now, @philipc2 !

Also, re benchmarking, I'd expect mpas_ocean.ConstructFaceLatLon.time_cartesian_averaging('120km') to show the difference this PR creates with the optimization, but it does not. Dataset is pretty small I think for that, which makes me think big data benchmarking would be important in the future.

philipc2 · 2024-12-23T18:42:31Z

uxarray/grid/coordinates.py

+    centroid_x = np.sum(node_x[face_nodes] * mask, axis=1) / n_nodes_per_face
+    centroid_y = np.sum(node_y[face_nodes] * mask, axis=1) / n_nodes_per_face
+    centroid_z = np.sum(node_z[face_nodes] * mask, axis=1) / n_nodes_per_face


This here is problematic for grids that have variable element sizes.

face_node_connectivity will contain fill values, which will lead to an index error here.

IndexError: index -9223372036854775808 is out of bounds for axis 0 with size 83886080

This wasn't an issue for the SCREAM dataset because it is completely composed of quads.

How about fixing this and changing/standardizing the way we handle connectivity?

changing/standardizing the way we handle connectivity

Can you elaborate? I've explored storing connectivity as partitioned arrays in #978, although for something as simple as computing centroids, Numba seems to a significantly simpler implementation, performing closer to compiled language performance without needing to create an additional mask.

My understanding is that with -1 etc. the mask approach fails, isn't that correct? so, I was thinking adopting a convention that allows for masking to work would simply things.

Not sure about the performance as I think NumPy's vectorized operations might be highly optimized for numerical computations and hopefully outperform the explicit loop here, even when those loops are parallelized with Numba.

philipc2 · 2024-12-23T18:48:29Z

Here is a minimal example that fails. Feel free to create a test case from this

node_lon = np.array([-20.0, 0.0, 20.0, -20, -40])
node_lat = np.array([-10.0, 10.0, -10.0, 10, -10])
face_node_connectivity = np.array([[0, 1, 2, -1], [0, 1, 3, 4]])

uxgrid = ux.Grid.from_topology(
    node_lon=node_lon,
    node_lat=node_lat,
    face_node_connectivity=face_node_connectivity,
    fill_value=-1,
)

uxgrid.construct_face_centers()

philipc2 · 2024-12-23T19:51:02Z

Consider this implementation with Numba.

@njit(cache=True, parallel=True)
def _construct_face_centroids(node_x, node_y, node_z, face_nodes, n_nodes_per_face):
    centroid_x = np.zeros((face_nodes.shape[0]), dtype=np.float64)
    centroid_y = np.zeros((face_nodes.shape[0]), dtype=np.float64)
    centroid_z = np.zeros((face_nodes.shape[0]), dtype=np.float64)
    n_face = n_nodes_per_face.shape[0]

    for i_face in prange(n_face):
        n_max_nodes = n_nodes_per_face[i_face]

        x = np.mean(node_x[face_nodes[i_face, 0:n_max_nodes]])
        y = np.mean(node_y[face_nodes[i_face, 0:n_max_nodes]])
        z = np.mean(node_z[face_nodes[i_face, 0:n_max_nodes]])

        x, y, z = _normalize_xyz_scalar(x, y, z)

        centroid_x[i_face] = x
        centroid_y[i_face] = y
        centroid_z[i_face] = z
    return centroid_x, centroid_y, centroid_z

Without partitioning our arrays, it's really difficult to find an elegant vectorized solution directly using NumPy.

rajeeja · 2024-12-23T19:58:16Z

Consider this implementation with Numba.

@njit(cache=True, parallel=True)
def _construct_face_centroids(node_x, node_y, node_z, face_nodes, n_nodes_per_face):
    centroid_x = np.zeros((face_nodes.shape[0]), dtype=np.float64)
    centroid_y = np.zeros((face_nodes.shape[0]), dtype=np.float64)
    centroid_z = np.zeros((face_nodes.shape[0]), dtype=np.float64)
    n_face = n_nodes_per_face.shape[0]

    for i_face in prange(n_face):
        n_max_nodes = n_nodes_per_face[i_face]

        x = np.mean(node_x[face_nodes[i_face, 0:n_max_nodes]])
        y = np.mean(node_y[face_nodes[i_face, 0:n_max_nodes]])
        z = np.mean(node_z[face_nodes[i_face, 0:n_max_nodes]])

        x, y, z = _normalize_xyz_scalar(x, y, z)

        centroid_x[i_face] = x
        centroid_y[i_face] = y
        centroid_z[i_face] = z
    return centroid_x, centroid_y, centroid_z

Without partitioning our arrays, it's really difficult to find an elegant vectorized solution directly using NumPy.

Yes, it was a sticky one, I left it for a later PR to optimize this.

rajeeja · 2025-01-10T23:43:04Z

this also might use a bit more memory due to the creation of the mask array - but that should be fine for larger grids and the speedup would be pretty good considering it uses vectorized operations and broadcasting. Seems to work and I'm okay with approving this.

philipc2 · 2025-01-10T23:46:58Z

@erogluorhan

Let me know if you have any thoughts about the Numba implementation that I shared.

rajeeja · 2025-01-11T01:00:49Z

@erogluorhan

Let me know if you have any thoughts about the Numba implementation that I shared.

Fine with the Numba suggestion also, we have modify this PR to use that, we use Numba all over the project.

Vectorize coordinates._construct_face_centroids()

50c16e1

erogluorhan requested review from rajeeja and philipc2 December 21, 2024 16:33

philipc2 assigned erogluorhan Dec 21, 2024

philipc2 added the run-benchmark Run ASV benchmark workflow label Dec 21, 2024

philipc2 added this to the Scalability & Performance milestone Dec 21, 2024

philipc2 removed the run-benchmark Run ASV benchmark workflow label Dec 21, 2024

Merge branch 'main' into optimize_face_centroids

b13aa57

philipc2 added the run-benchmark Run ASV benchmark workflow label Dec 23, 2024

philipc2 requested changes Dec 23, 2024

View reviewed changes

Merge branch 'main' into optimize_face_centroids

95939fe

rajeeja approved these changes Jan 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vectorize `coordinates._construct_face_centroids()` #1117

Vectorize `coordinates._construct_face_centroids()` #1117

erogluorhan commented Dec 21, 2024 •

edited

Loading

rajeeja commented Dec 21, 2024

github-actions bot commented Dec 21, 2024 •

edited

Loading

philipc2 commented Dec 21, 2024

erogluorhan commented Dec 21, 2024

erogluorhan commented Dec 21, 2024

philipc2 commented Dec 21, 2024

erogluorhan commented Dec 21, 2024

philipc2 Dec 23, 2024

rajeeja Jan 10, 2025

philipc2 Jan 10, 2025

rajeeja Jan 11, 2025

philipc2 commented Dec 23, 2024

philipc2 commented Dec 23, 2024

rajeeja commented Dec 23, 2024

rajeeja commented Jan 10, 2025

philipc2 commented Jan 10, 2025

rajeeja commented Jan 11, 2025

Vectorize coordinates._construct_face_centroids() #1117

Are you sure you want to change the base?

Vectorize coordinates._construct_face_centroids() #1117

Conversation

erogluorhan commented Dec 21, 2024 • edited Loading

Overview

Results

PR Checklist

rajeeja commented Dec 21, 2024

github-actions bot commented Dec 21, 2024 • edited Loading

ASV Benchmarking

philipc2 commented Dec 21, 2024

erogluorhan commented Dec 21, 2024

erogluorhan commented Dec 21, 2024

philipc2 commented Dec 21, 2024

erogluorhan commented Dec 21, 2024

philipc2 Dec 23, 2024

Choose a reason for hiding this comment

rajeeja Jan 10, 2025

Choose a reason for hiding this comment

philipc2 Jan 10, 2025

Choose a reason for hiding this comment

rajeeja Jan 11, 2025

Choose a reason for hiding this comment

philipc2 commented Dec 23, 2024

philipc2 commented Dec 23, 2024

rajeeja commented Dec 23, 2024

rajeeja commented Jan 10, 2025

philipc2 commented Jan 10, 2025

rajeeja commented Jan 11, 2025

Vectorize `coordinates._construct_face_centroids()` #1117

Vectorize `coordinates._construct_face_centroids()` #1117

erogluorhan commented Dec 21, 2024 •

edited

Loading

github-actions bot commented Dec 21, 2024 •

edited

Loading