Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[POC] : subdivision by bloc size #601

Open
6 tasks
adebardo opened this issue Oct 14, 2024 · 6 comments
Open
6 tasks

[POC] : subdivision by bloc size #601

adebardo opened this issue Oct 14, 2024 · 6 comments
Labels
[POC] Conception To review Tickets needs approval about it conception

Comments

@adebardo
Copy link

adebardo commented Oct 14, 2024

Context

The CNES QI team is interested in the possibility of performing block-based coregistration, which is why studies are currently being conducted to test this functionality, referred to as blockwise. At the moment, users must specify the number of subdivisions they wish to apply. However, the QI team would prefer to specify block sizes instead. The goal of this ticket is to propose an implementation of this capability.

We will start the process of managing the size of an overlap between tiles. In this ticket, it will have a default value, and another ticket will be opened later to better manage this overlap.

Proposed Code

Currently, subdivision is handled by the subdivide_array function, so we propose managing block sizes at this level.

  1. Extend Subdivision:
    Currently, users input a number, but instead, we could allow them to input a dictionary or list containing two integers for rows and columns, and then calculate the number of subdivisions based on these values.
    For example, for an image of size 985x1332:

    • Scenario 1: Subdivision 64, 8 rows/8 columns, block size of 123 rows by 166 columns.
    • Scenario 2: Block size of 100x100, 10 row blocks, 14 column blocks, 10*14 = 140 blocks.
  2. Update the Following:

    def subdivide_array(self, shape: tuple[int, ...]) -> NDArrayf:  
        """  
        Return the grid subdivision for a given DEM shape.  
        :param shape: The shape of the input DEM.  
        :returns: An array of shape 'shape' with 'self.subdivision' unique indices.  
        """  
        if len(shape) == 3 and shape[0] == 1:  # Account for (1, row, col) shapes  
            shape = (shape[1], shape[2])  
    
        if isinstance(self.subdivision, list):  
            nb_bloc_row = np.ceil(shape[0] / self.subdivision[0])  
            nb_bloc_col = np.ceil(shape[1] / self.subdivision[1])  
    
            self.subdivision = int(nb_bloc_row * nb_bloc_col)  
    
        return subdivide_array(shape, count=self.subdivision)
    • Update the associated docstring.

    • Copy a function that allow blocks to overlap

def patchify(arr, nblocks, overlap):
           # manually split with overlap a datacube for parallel computing
       
           overlap = int(np.floor(overlap))
           patches = []
           nx, ny = np.shape(arr)
       
           nx_sub = nx // nblocks[0]
           ny_sub = ny // nblocks[1]
       
           split = [[nx_sub * i, min(nx_sub * (i + 1), nx), ny_sub * j, min(ny_sub * (j + 1), ny)] for i in
                    range(nblocks[0] + 1) for j in range(nblocks[1] + 1)]
           over = [[max(0, l[0] - overlap), min(nx, l[1] + overlap), max(0, l[2] - overlap), min(l[3] + overlap, ny)] for l in
                   split]
           inv = []
           for k in range(len(split)):
               x0, x1, y0, y1 = split[k]
               i0, i1, j0, j1 = over[k]
               patches.append(arr[i0:i1, j0:j1])
               inv.append([x0 - i0, x1 - i0, y0 - j0, y1 - j0])

   return patches, inv, split
  • Replace the groups variable here by calling function patchify

Tests

  • A test should be added to this test with a parameter containing a list (or a dict) of tile sizes for rows and columns.
  • Unit tests should be implemented for patchify function

Documentation

  • It would be helpful to add a comment in the plot_blockwise_coreg file to inform users that an alternative solution is available.

/ estimate 5d

@adebardo adebardo added the [POC] Conception To review Tickets needs approval about it conception label Oct 14, 2024
@adebardo
Copy link
Author

@adehecq, @rhugonnet
Currently, internal work is being done to test the blockwise method (and its current bug), but we would like to add this specific functionality. What do you think?
@duboise-cnes :)

@rhugonnet
Copy link
Member

rhugonnet commented Oct 15, 2024

This is a great idea! Being only able to specify a "number of blocks" is definitely a limitation. Having the option of specifying a "number of blocks" could be retained just for the specific use case (immediately translating it into a size).

For the subdivision of the array: To avoid edge effects during transformations of the chunks, it is usually necessary to use an overlap. I have some old code that does this here (which does not have proper tests written, but was used a lot and checked qualitatively), if can be useful: https://github.com/iamdonovan/pyddem/blob/main/pyddem/fit_tools.py#L1390 (the splitter/stitcher are for subdividing without overlap, and patchify/unpatchify with overlap; all work for a 3D data cube, but should also be applicable to 2D).

The problem is slightly different as a classic split/stitch here, as we don't know in advance the horizontal shift that the coregistration will find for each chunk. So we'd need some kind of splitter (for the fit()) followed by a patchify/unpatchify (for the apply()) that adjusts the size of the overlap based on the transformation to avoid introducing voids.
(This could be a way of running apply() that would solve the current bug)

@erikmannerfelt could have more ideas/insights as he implemented the original functionality 🙂.

@rhugonnet
Copy link
Member

rhugonnet commented Oct 31, 2024

After more thinking, I'm wondering if directly using https://docs.dask.org/en/latest/generated/dask.array.overlap.map_overlap.html would make more sense, even if for now our input arrays are not chunked in-memory.

It would:

  • Make the functionality natively work out-of-memory on Dask arrays (probably with some small tweaks) the moment we have a dem Xarray accessor,
  • Work the same on the loaded NumPy arrays that we currently have,
  • Remove the requirement of having to write unit tests for our custom functions patchify/unpatchify (as map_overlap is already extensively tested in Dask),
  • (Potentially) Allow us to re-use similar code for other co-registration steps that require Dask mapping with overlap (Dask for coregistration #525).

The only downside is that Dask is more complex. For instance, we likely won't be able to use directly map_overlap because the fit() function is quite a specific operation, and spits out metadata instead of another array. We'll probably have to map the chunks with overlap in a custom delayed function. But we should be able to re-use existing code for that, for instance we managed a delayed chunking with overlap and atypical output in interp_points here (which is fully tested): https://github.com/GlacioHack/geoutils/blob/e7538d1cd9bec79508fe1452991cbef8c5078bb9/geoutils/raster/delayed.py#L348. And several of us have accumulated enough experience with it to help!
Could be worth it, what do you think? 😄

@duboise-cnes
Copy link
Member

After quick reading, I am not sure to understand completely all. Maybe in two phases:

  • first resolving bug and subdivision by bloc size simply
  • secondly discuss further on dask integration. maybe in another issue ?

I have discussed with @guillaumeeb from CNES who have more knowledge on dask usage than me. Do you have some insights on this dask impact proposed by @rhugonnet ? A good way to be compatible with dask would benefit the project, i agree.

@rhugonnet
Copy link
Member

An additional point I just thought of: If resolving the bug involves removing the warp_dem function, that's for the best. It is likely responsible for artefacts seen in #584, and it would allow to remove scikit-image as a dependency altogether (I opened #654).

Agree on the two phases.
My main point was that we maybe shouldn't lose too much time implementing and testing a custom patchify function examplified in the PR description above, when we can use the one existing in Dask (independently of actually having Dask support in terms of memory usage/parallelization for now).
So in that sense, Dask would also be slightly part of the first phase but simply as a replacement of the patchify function (which also runs on normal arrays).
Then, in a second phase, we could refine and test out-of-memory behaviour, parallelization, and other Dask parameters (for which we would already have the right code structure).

@guillaumeeb
Copy link

Hi everyone, sorry for the delay in response.

I totally concur with @rhugonnet analysis!! When I first read the original post of this issue, I directly thought of Dask and map_overlap as a solution! Complying with Dask interfaces would enable you to do blockwize operations for bigger than memory inputs, as well as distributed and parallel processing. And as said, Dask already implements part of your needs!

The important point, even if doing this in two phases, would be to already think of Dask as a design goal. For example, subdivisions as expressed as chunk shapes (nbrows, nbcols) in Dask, but I think this is what you are aiming at.

One last warning about map_overlap, this can complexify a Dask graph in big cases, there might be optimizations to look for on Dask side, but let's see that afterwards.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[POC] Conception To review Tickets needs approval about it conception
Projects
None yet
Development

No branches or pull requests

4 participants