Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorize coordinates._construct_face_centroids() #1117

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

erogluorhan
Copy link
Member

@erogluorhan erogluorhan commented Dec 21, 2024

Closes #1116.

Overview

Optimizes coordinates.py/_construct_face_centroids() by replacing the for-loop with vectorization with the help of a masking method thanks to @hongyuchen1030 's older suggestion

Further optimization of the code would be possible, but maybe it is handled in future PRs.

FYI @philipc2 : The masking method this PR uses might be useful for cases we previously dealt with via partitioning, e.g. the nodal_averaging problem we look into last year

FYI @rajeeja : This is where the cartesian-based centroid calculations are being fixed. Welzl is a bit different but you may still want to look into it.

Results

#1116 shows the profiling results for the existing code for a 4GB SCREAM dataset where the execution time is around 5 mins.

The optimized code gives around 6 seconds, see below:

Screenshot 2024-12-21 at 8 34 37 AM

PR Checklist

General

  • An issue is linked created and linked
  • Add appropriate labels
  • Filled out Overview and Expected Usage (if applicable) sections

Testing

  • Tests cover all possible logical paths in your function

@rajeeja
Copy link
Contributor

rajeeja commented Dec 21, 2024

Good fix, will test welzl and see appropriate fix.

@philipc2 philipc2 added the run-benchmark Run ASV benchmark workflow label Dec 21, 2024
@philipc2 philipc2 added this to the Scalability & Performance milestone Dec 21, 2024
Copy link

github-actions bot commented Dec 21, 2024

ASV Benchmarking

Benchmark Comparison Results

Benchmarks that have improved:

Change Before [0e45eda] After [832aae7] Ratio Benchmark (Parameter)
- 452M 408M 0.9 face_bounds.FaceBounds.peakmem_face_bounds(PosixPath('/home/runner/work/uxarray/uxarray/test/meshfiles/ugrid/geoflow-small/grid.nc'))
- 640M 398M 0.62 face_bounds.FaceBounds.peakmem_face_bounds(PosixPath('/home/runner/work/uxarray/uxarray/test/meshfiles/ugrid/quad-hexagon/grid.nc'))
- 488M 390M 0.8 mpas_ocean.Integrate.peakmem_integrate('480km')

Benchmarks that have stayed the same:

Change Before [0e45eda] After [832aae7] Ratio Benchmark (Parameter)
404M 404M 1.00 face_bounds.FaceBounds.peakmem_face_bounds(PosixPath('/home/runner/work/uxarray/uxarray/test/meshfiles/mpas/QU/oQU480.231010.nc'))
439M 443M 1.01 face_bounds.FaceBounds.peakmem_face_bounds(PosixPath('/home/runner/work/uxarray/uxarray/test/meshfiles/scrip/outCSne8/outCSne8.nc'))
19.3±0.1ms 19.3±0.1ms 1.00 face_bounds.FaceBounds.time_face_bounds(PosixPath('/home/runner/work/uxarray/uxarray/test/meshfiles/mpas/QU/oQU480.231010.nc'))
7.84±0.1ms 7.65±0.1ms 0.97 face_bounds.FaceBounds.time_face_bounds(PosixPath('/home/runner/work/uxarray/uxarray/test/meshfiles/scrip/outCSne8/outCSne8.nc'))
44.5±0.3ms 44.3±0.2ms 0.99 face_bounds.FaceBounds.time_face_bounds(PosixPath('/home/runner/work/uxarray/uxarray/test/meshfiles/ugrid/geoflow-small/grid.nc'))
4.10±0.09ms 4.08±0.1ms 0.99 face_bounds.FaceBounds.time_face_bounds(PosixPath('/home/runner/work/uxarray/uxarray/test/meshfiles/ugrid/quad-hexagon/grid.nc'))
3.18±0.04s 3.21±0.04s 1.01 import.Imports.timeraw_import_uxarray
691±6μs 661±5μs 0.96 mpas_ocean.CheckNorm.time_check_norm('120km')
446±3μs 451±3μs 1.01 mpas_ocean.CheckNorm.time_check_norm('480km')
686±20ms 698±3ms 1.02 mpas_ocean.ConnectivityConstruction.time_face_face_connectivity('120km')
43.4±0.8ms 43.7±0.5ms 1.01 mpas_ocean.ConnectivityConstruction.time_face_face_connectivity('480km')
1.86±0.02ms 1.90±0.03ms 1.02 mpas_ocean.ConnectivityConstruction.time_n_nodes_per_face('120km')
702±80μs 603±20μs ~0.86 mpas_ocean.ConnectivityConstruction.time_n_nodes_per_face('480km')
6.28±0.04ms 6.25±0.09ms 0.99 mpas_ocean.ConstructFaceLatLon.time_cartesian_averaging('120km')
3.70±0.05ms 3.70±0.06ms 1.00 mpas_ocean.ConstructFaceLatLon.time_cartesian_averaging('480km')
3.58±0.02s 3.62±0.02s 1.01 mpas_ocean.ConstructFaceLatLon.time_welzl('120km')
230±0.7ms 230±1ms 1.00 mpas_ocean.ConstructFaceLatLon.time_welzl('480km')
1.20±0.01μs 1.31±0μs 1.09 mpas_ocean.ConstructTreeStructures.time_ball_tree('120km')
297±1ns 313±3ns 1.05 mpas_ocean.ConstructTreeStructures.time_ball_tree('480km')
807±6ns 815±2ns 1.01 mpas_ocean.ConstructTreeStructures.time_kd_tree('120km')
284±3ns 300±1ns 1.06 mpas_ocean.ConstructTreeStructures.time_kd_tree('480km')
442±9ms 460±7ms 1.04 mpas_ocean.CrossSection.time_const_lat('120km', 1)
227±1ms 236±4ms 1.04 mpas_ocean.CrossSection.time_const_lat('120km', 2)
118±2ms 122±1ms 1.03 mpas_ocean.CrossSection.time_const_lat('120km', 4)
366±3ms 375±5ms 1.02 mpas_ocean.CrossSection.time_const_lat('480km', 1)
181±3ms 186±3ms 1.03 mpas_ocean.CrossSection.time_const_lat('480km', 2)
94.2±0.7ms 95.8±2ms 1.02 mpas_ocean.CrossSection.time_const_lat('480km', 4)
127±2ms 127±0.5ms 1.00 mpas_ocean.DualMesh.time_dual_mesh_construction('120km')
9.21±0.3ms 9.12±0.5ms 0.99 mpas_ocean.DualMesh.time_dual_mesh_construction('480km')
1.05±0.01s 1.09±0.01s 1.03 mpas_ocean.GeoDataFrame.time_to_geodataframe('120km', False)
53.7±1ms 55.1±0.7ms 1.03 mpas_ocean.GeoDataFrame.time_to_geodataframe('120km', True)
85.2±0.6ms 87.5±1ms 1.03 mpas_ocean.GeoDataFrame.time_to_geodataframe('480km', False)
5.66±0.2ms 5.89±0.1ms 1.04 mpas_ocean.GeoDataFrame.time_to_geodataframe('480km', True)
345M 343M 0.99 mpas_ocean.Gradient.peakmem_gradient('120km')
321M 320M 1.00 mpas_ocean.Gradient.peakmem_gradient('480km')
2.85±0.04ms 2.82±0.03ms 0.99 mpas_ocean.Gradient.time_gradient('120km')
319±8μs 321±5μs 1.01 mpas_ocean.Gradient.time_gradient('480km')
230±10μs 247±3μs 1.08 mpas_ocean.HoleEdgeIndices.time_construct_hole_edge_indices('120km')
116±0.7μs 119±0.6μs 1.03 mpas_ocean.HoleEdgeIndices.time_construct_hole_edge_indices('480km')
406M 406M 1.00 mpas_ocean.Integrate.peakmem_integrate('120km')
178±2ms 177±0.7ms 0.99 mpas_ocean.Integrate.time_integrate('120km')
12.1±0.06ms 12.0±0.03ms 0.99 mpas_ocean.Integrate.time_integrate('480km')
359±3ms 368±2ms 1.03 mpas_ocean.MatplotlibConversion.time_dataarray_to_polycollection('120km', 'exclude')
357±0.9ms 365±2ms 1.02 mpas_ocean.MatplotlibConversion.time_dataarray_to_polycollection('120km', 'include')
370±10ms 366±2ms 0.99 mpas_ocean.MatplotlibConversion.time_dataarray_to_polycollection('120km', 'split')
24.7±0.2ms 24.1±0.6ms 0.98 mpas_ocean.MatplotlibConversion.time_dataarray_to_polycollection('480km', 'exclude')
24.6±0.2ms 24.5±0.2ms 1.00 mpas_ocean.MatplotlibConversion.time_dataarray_to_polycollection('480km', 'include')
24.3±0.3ms 24.2±0.5ms 1.00 mpas_ocean.MatplotlibConversion.time_dataarray_to_polycollection('480km', 'split')
56.3±0.3ms 56.4±0.1ms 1.00 mpas_ocean.RemapDownsample.time_inverse_distance_weighted_remapping
45.8±0.1ms 46.2±0.2ms 1.01 mpas_ocean.RemapDownsample.time_nearest_neighbor_remapping
360±1ms 361±0.9ms 1.00 mpas_ocean.RemapUpsample.time_inverse_distance_weighted_remapping
266±0.8ms 266±0.4ms 1.00 mpas_ocean.RemapUpsample.time_nearest_neighbor_remapping
318M 316M 0.99 quad_hexagon.QuadHexagon.peakmem_open_dataset
316M 316M 1.00 quad_hexagon.QuadHexagon.peakmem_open_grid
7.09±0.2ms 7.03±0.3ms 0.99 quad_hexagon.QuadHexagon.time_open_dataset
5.78±0.1ms 5.81±0.07ms 1.00 quad_hexagon.QuadHexagon.time_open_grid

@philipc2 philipc2 removed the run-benchmark Run ASV benchmark workflow label Dec 21, 2024
@philipc2
Copy link
Member

Nice work @erogluorhan

It appears that a few ASV benchmarks are failing (may or may not be due to the changes in this PR). I can look at it deeper into it on Monday.

@erogluorhan
Copy link
Member Author

All sound good, thanks both @rajeeja @philipc2 !

@erogluorhan
Copy link
Member Author

I can't see any failing benchmarks though! @philipc2

@philipc2
Copy link
Member

I can't see any failing benchmarks though! @philipc2

If you take a look at the GitHub Bot above, it shows which ones failed. If they failed before and after, it means something was wrong before this PR.

Looks like this PR didn't break anything, must have been something with the benchmark written before.

@erogluorhan
Copy link
Member Author

Ok, I see what you mean now, @philipc2 !

Also, re benchmarking, I'd expect mpas_ocean.ConstructFaceLatLon.time_cartesian_averaging('120km') to show the difference this PR creates with the optimization, but it does not. Dataset is pretty small I think for that, which makes me think big data benchmarking would be important in the future.

@philipc2 philipc2 added the run-benchmark Run ASV benchmark workflow label Dec 23, 2024
Comment on lines +335 to +337
centroid_x = np.sum(node_x[face_nodes] * mask, axis=1) / n_nodes_per_face
centroid_y = np.sum(node_y[face_nodes] * mask, axis=1) / n_nodes_per_face
centroid_z = np.sum(node_z[face_nodes] * mask, axis=1) / n_nodes_per_face
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This here is problematic for grids that have variable element sizes.

face_node_connectivity will contain fill values, which will lead to an index error here.

IndexError: index -9223372036854775808 is out of bounds for axis 0 with size 83886080

This wasn't an issue for the SCREAM dataset because it is completely composed of quads.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about fixing this and changing/standardizing the way we handle connectivity?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changing/standardizing the way we handle connectivity

Can you elaborate? I've explored storing connectivity as partitioned arrays in #978, although for something as simple as computing centroids, Numba seems to a significantly simpler implementation, performing closer to compiled language performance without needing to create an additional mask.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that with -1 etc. the mask approach fails, isn't that correct? so, I was thinking adopting a convention that allows for masking to work would simply things.

Not sure about the performance as I think NumPy's vectorized operations might be highly optimized for numerical computations and hopefully outperform the explicit loop here, even when those loops are parallelized with Numba.

@philipc2
Copy link
Member

Here is a minimal example that fails. Feel free to create a test case from this

node_lon = np.array([-20.0, 0.0, 20.0, -20, -40])
node_lat = np.array([-10.0, 10.0, -10.0, 10, -10])
face_node_connectivity = np.array([[0, 1, 2, -1], [0, 1, 3, 4]])

uxgrid = ux.Grid.from_topology(
    node_lon=node_lon,
    node_lat=node_lat,
    face_node_connectivity=face_node_connectivity,
    fill_value=-1,
)

uxgrid.construct_face_centers()

@philipc2
Copy link
Member

Consider this implementation with Numba.

@njit(cache=True, parallel=True)
def _construct_face_centroids(node_x, node_y, node_z, face_nodes, n_nodes_per_face):
    centroid_x = np.zeros((face_nodes.shape[0]), dtype=np.float64)
    centroid_y = np.zeros((face_nodes.shape[0]), dtype=np.float64)
    centroid_z = np.zeros((face_nodes.shape[0]), dtype=np.float64)
    n_face = n_nodes_per_face.shape[0]

    for i_face in prange(n_face):
        n_max_nodes = n_nodes_per_face[i_face]

        x = np.mean(node_x[face_nodes[i_face, 0:n_max_nodes]])
        y = np.mean(node_y[face_nodes[i_face, 0:n_max_nodes]])
        z = np.mean(node_z[face_nodes[i_face, 0:n_max_nodes]])

        x, y, z = _normalize_xyz_scalar(x, y, z)

        centroid_x[i_face] = x
        centroid_y[i_face] = y
        centroid_z[i_face] = z
    return centroid_x, centroid_y, centroid_z

Without partitioning our arrays, it's really difficult to find an elegant vectorized solution directly using NumPy.

@rajeeja
Copy link
Contributor

rajeeja commented Dec 23, 2024

Consider this implementation with Numba.

@njit(cache=True, parallel=True)
def _construct_face_centroids(node_x, node_y, node_z, face_nodes, n_nodes_per_face):
    centroid_x = np.zeros((face_nodes.shape[0]), dtype=np.float64)
    centroid_y = np.zeros((face_nodes.shape[0]), dtype=np.float64)
    centroid_z = np.zeros((face_nodes.shape[0]), dtype=np.float64)
    n_face = n_nodes_per_face.shape[0]

    for i_face in prange(n_face):
        n_max_nodes = n_nodes_per_face[i_face]

        x = np.mean(node_x[face_nodes[i_face, 0:n_max_nodes]])
        y = np.mean(node_y[face_nodes[i_face, 0:n_max_nodes]])
        z = np.mean(node_z[face_nodes[i_face, 0:n_max_nodes]])

        x, y, z = _normalize_xyz_scalar(x, y, z)

        centroid_x[i_face] = x
        centroid_y[i_face] = y
        centroid_z[i_face] = z
    return centroid_x, centroid_y, centroid_z

Without partitioning our arrays, it's really difficult to find an elegant vectorized solution directly using NumPy.

Yes, it was a sticky one, I left it for a later PR to optimize this.

@rajeeja
Copy link
Contributor

rajeeja commented Jan 10, 2025

this also might use a bit more memory due to the creation of the mask array - but that should be fine for larger grids and the speedup would be pretty good considering it uses vectorized operations and broadcasting. Seems to work and I'm okay with approving this.

@philipc2
Copy link
Member

@erogluorhan

Let me know if you have any thoughts about the Numba implementation that I shared.

@rajeeja
Copy link
Contributor

rajeeja commented Jan 11, 2025

@erogluorhan

Let me know if you have any thoughts about the Numba implementation that I shared.

Fine with the Numba suggestion also, we have modify this PR to use that, we use Numba all over the project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run-benchmark Run ASV benchmark workflow
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimize Face Centroid Calculations
4 participants