-
Notifications
You must be signed in to change notification settings - Fork 375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FLAIR#2 Dataset and Datamodule Integration #2394
base: main
Are you sure you want to change the base?
Conversation
…mg and msk) Updates in the custom raster dataset tutorial and the actual file documentation. The previous recommended approach (overriding `__get_item__`) is outdated. Refs: microsoft#2292 (reply in thread)
Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>
Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>
… type individually
Not fully functioning yet, contains copy paste from other datasets
Additionally, some small code refactors are done
…d refine plotting
Using the entire sentinel-2 image and a matplotlib patch to debug, otherwise it is really hard to find correct spot due to low resolution
…y()` for sentinel With the nested dict, it was not possible to download dynamically
md5s might change due to timestamps, this eases the process of changing md5
Finally got it debugged, and it was a system error. Download, extract, and training on the aerial data seems to all work as it should. Sorry about the noise! Completely unrelated to this PR, but it was a corruption in one of the zpools in ZFS. So if you start using that at some point, and you get weird errors that looks like disk errors, but checking the disk says everything is fine, then try checking the status on your zpools. Spent too much time identifying this 😑 |
Can you resolve the merge conflicts so we can run CI on this PR? |
…l and minor bug fixes
…ing inconsistencies in original dataset Naming inconsistencies: `flair-2_centroids_sp_to_patch.json` vs `flair_2_centroids_sp_to_patch.zip`
I think everything should be on track so far. I am getting a ruff error for unsorted tuples in the |
If you use ruff 0.8.0 it will sort |
torchgeo/datasets/flair2.py
Outdated
"""Get statistics (min, max, means, stdvs) for each used band in order. | ||
|
||
Args: | ||
split (str): Split for which to get statistics (currently only for train) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the docstring notion in torchgeo, we do not include the type in the docstring again, only function arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will also resolve the failing docs test
torchgeo/datasets/flair2.py
Outdated
return tensor | ||
|
||
def _load_sentinel(self, path: Path) -> Tensor: | ||
# FIXME: should this really be returned as a tuple? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be removed, right?
self.root, | ||
md5=self.md5s.get(url, None) if self.checksum else None, | ||
) | ||
# FIXME: Why is download_url not checking integrity (tests run through)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this fixed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was still an issue last time i checked. As mentioned in the first text of the PRQ, omehow, when I run the pytests with wrong md5 hashes, the integrity is not checked unless I explicitly call it here again (it is implicity called in download_url
).
self.root, | ||
md5=self.md5s.get(url, None) if self.checksum else None, | ||
) | ||
# FIXME: Why is download_url not checking integrity (tests run through)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about this FIXME?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left this here as somewhat of a reminder, because it has the same behavior as the parent class. I.e. currently all tests pass even with wrong md5s.
""" | ||
super().__init__(FLAIR2, batch_size, num_workers, **kwargs) | ||
|
||
self.patch_size = _to_tuple(patch_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's a good idea. Could either be included here, or in a separate PR.
tests/data/flair2/FLAIR2/md5s.txt
Outdated
@@ -0,0 +1,8 @@ | |||
/home/mathias/Dev/forks/torchgeo/tests/data/flair2/FLAIR2/flair_2_labels_test.zip: b13c4a3cb7ebb5cadddc36474bb386f9 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe remove the personal directory from this text file.
torchgeo/datasets/flair2.py
Outdated
|
||
rgb_indices = [self.all_bands.index(band) for band in self.rgb_bands] | ||
# Check if RGB bands are present in self.bands | ||
if not all([band in self.bands for band in self.rgb_bands]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Codecoverage is indicating that the RGB Band Missing is not being hit, so I think you just need to add a separate plot test similar to
torchgeo/tests/datasets/test_eurosat.py
Line 110 in 2f3e8fd
def test_plot_rgb(self, dataset: EuroSAT, tmp_path: Path) -> None: |
torchgeo/datamodules/flair2.py
Outdated
K.Normalize(mean=self.mean, std=self.std), data_keys=['image', 'mask'] | ||
) | ||
|
||
self.augs = augs if augs is not None else self.aug |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is self.augs
intended to act on the data somewhere? My immediate thought was that it would be related to the augmentations part of the base datamodule:
torchgeo/torchgeo/datamodules/geo.py
Lines 70 to 79 in fedf993
# Data augmentation | |
Transform = Callable[[dict[str, Tensor]], dict[str, Tensor]] | |
self.aug: Transform = K.AugmentationSequential( | |
K.Normalize(mean=self.mean, std=self.std), data_keys=None, keepdim=True | |
) | |
self.train_aug: Transform | None = None | |
self.val_aug: Transform | None = None | |
self.test_aug: Transform | None = None | |
self.predict_aug: Transform | None = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.aug
is applied here if the split specific augmentations are not specified
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was also my understanding, and perhaps it's a typo where self.augs
was intended to be self.aug
, such that the augmentations would be applied automatically through the base datamodule. However, with the current implementation, if the user provides augmentations through the augs
parameter in the FLAIR2 datamodule, they won't have an effect, as they are being added to self.augs
, which doesn't seem to be applied to the data (as far as I can tell). I.e., maybe the intention was self.aug = augs if augs is not None else self.aug
(?)
torchgeo/datasets/flair2.py
Outdated
def _load_image(self, path: Path) -> Tensor: | ||
"""Load a single image. | ||
|
||
Args: | ||
path: path to the image | ||
|
||
Returns: | ||
Tensor: the loaded image | ||
""" | ||
with rasterio.open(path) as f: | ||
array: np.typing.NDArray[np.int_] = f.read() | ||
tensor = torch.from_numpy(array).float() | ||
if 'B05' in self.bands: | ||
# Height channel will always be the last dimension | ||
tensor[-1] = torch.div(tensor[-1], 5) | ||
|
||
return tensor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the bands perhaps be extracted here based on self.bands
? E.g., something like
def _load_image(self, path: Path) -> Tensor:
"""Load a single image.
Args:
path: path to the image
Returns:
Tensor: the loaded image
"""
with rasterio.open(path) as f:
array: np.typing.NDArray[np.int_] = f.read()
tensor = torch.from_numpy(array).float()
if 'B05' in self.bands:
# Height channel will always be the last dimension
tensor[-1] = torch.div(tensor[-1], 5)
# Extract the bands to be used. E.g., self.bands=("B01", "B02", "B03") will extract the RGB bands.
tensor = tensor[[int(band[-2:]) - 1 for band in self.bands]]
return tensor
Then when a user has defined n bands in self.bands
, the returned sample will only contain those bands, instead of all five. Perhaps self.bands
should also be renamed to self.aerial_bands
, and a self.sentinel_bands
should be added(?)
… bands when indexing, include RGBMissing testcase refs: microsoft#2394 (comment), microsoft#2394 (comment)
Thanks everyone for reviewing. Tried to apply all suggested changes. As for the checks, according to the log we face the following error: |
Error: reference before assignment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does anyone have general opinions on whether we should call this:
FLAIR2()
or:
FLAIR(version=2)
@@ -18,6 +18,7 @@ Dataset,Task,Source,License,# Samples,# Classes,Size (px),Resolution (m),Bands | |||
`FAIR1M`_,OD,Gaofen/Google Earth,"CC-BY-NC-SA-3.0","15,000",37,"1,024x1,024",0.3--0.8,RGB | |||
`Fields Of The World`_,"S,I",Sentinel-2,"Various","70795","2,3",256x256,10,MSI | |||
`FireRisk`_,C,NAIP Aerial,"CC-BY-NC-4.0","91,872",7,"320x320",1,RGB | |||
`FLAIR2`_,S,"IGN, Sentinel-2",OPENLICENSE-2.0,7741,13--18,512x512,0.2--20,"RGB+NIR+NDSM, MSI", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
`FLAIR2`_,S,"IGN, Sentinel-2",OPENLICENSE-2.0,7741,13--18,512x512,0.2--20,"RGB+NIR+NDSM, MSI", | |
`FLAIR2`_,S,"IGN, Sentinel-2",OPENLICENSE-2.0,7741,13--18,512x512,0.2--20,"RGB+NIR+NDSM, MSI" |
This should fix the doc tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The correct SPDX license identifier is actually etalab-2.0
, not OPENLICENSE-2.0
. There are apparently a lot of "open license" licenses: https://spdx.org/licenses/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We seem to be creating a lot of fake test, is it possible to reduce this number and still test things appropriately?
# FLAIR dataset is released under open license 2.0 | ||
# ..... https://www.etalab.gouv.fr/wp-content/uploads/2018/11/open-licence.pdf | ||
# ..... https://ignf.github.io/FLAIR/#FLAIR2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for this, it's already included in the docs. The important thing is that this file is MIT licensed like the rest of TorchGeo.
tests/datamodules/test_flair2.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can delete this file and instead test the dataset using a tests/conf/flair2.yaml
file and 1 line of code in tests/trainers/test_segmentation.py
. We are actively trying to get rid of tests/datamodules
since it doesn't test compatibility.
def dataset( | ||
self, monkeypatch: MonkeyPatch, tmp_path: Path, request: SubRequest | ||
) -> FLAIR2: | ||
md5s = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could also just skip checksum=True
and not bother monkeypatching any MD5s
Raises: | ||
DatasetNotFoundError | ||
|
||
..versionadded:: 0.7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
..versionadded:: 0.7 | |
.. versionadded:: 0.7 |
sentinel_bands: which bands to load from sentinel data (B01, B02, ..., B10) | ||
|
||
Raises: | ||
DatasetNotFoundError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs a description for when this error is raised
|
||
def _extract(self, file_path: str) -> None: | ||
"""Extract the dataset.""" | ||
assert isinstance(self.root, str | os.PathLike) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to assert this, it's already checked by mypy
a matplotlib Figure with the rendered sample | ||
""" | ||
|
||
def normalize_plot(tensor: Tensor) -> Tensor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would instead use .utils.percentile_normalization
if showing_predictions: | ||
predictions = sample['prediction'].numpy().astype('uint8').squeeze() | ||
|
||
# Remove none available plots |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Remove none available plots | |
# Remove non-available plots |
I think it'd make sense to use |
I am in contact with @agarioud. If I do understand him correctly, I doubt that new datasets will have the requirement to be backward-compatible.
But I agree. Adding versioning to
|
The upcoming FLAIR-INC dataset will include all data from FLAIR#1 (aerial images) and FLAIR#2 (which added Sentinel-2 data that were previously NPY files covering larger extents but will now have the same spatial extent as the aerial patches and be in TIFF format). Additionally, it will introduce five new modalities and a second supervision dataset. Backward compatibility is not fully ensured. For example, aerial images will have one less channel, as DSM/DTM has been introduced as a separate modality. Additionally, the supervision dataset regarding land-cover has reordered classes. Therefore, I also believe a |
This sounds like a good approach 👍 @agarioud sounds great. Looking forward to the release 🙂 |
It's not really a matter of compatibility. Basically, we either use: class FLAIR(NonGeoDataset, abc.ABC):
# shared base class
class FLAIR2(FLAIR):
# override specific stuff or: class FLAIR(NonGeoDataset):
def __init__(self, version=2, ...):
if version == 2:
# override specific stuff It's more a question of whether the devs think of this as a new version of an existing dataset or a new dataset. We usually go with the former because it offers a bit better reproducibility (if the default |
FLAIR#2 dataset
The
FLAIR #2 <https://github.com/IGNF/FLAIR-2>
dataset is an extensive dataset from the French National Institute of Geographical and Forest Information (IGN) that provides a unique and rich resource for large-scale geospatial analysis.The dataset is sampled countrywide and is composed of over 20 billion annotated pixels of very high resolution aerial imagery at 0.2 m spatial resolution, acquired over three years and different months (spatio-temporal domains).
The FLAIR2 dataset is a dataset for semantic segmentation of aerial images. It contains aerial images, sentinel-2 images and masks for 13 classes.
The dataset is split into a training and test set.
Implementation Details
NonGeoDataset
,__init()__
After discussions following #2303, we decided that at least until faulty mask data are fixed the flair2 ds will be of type
NonGeoDataset
. Other than with commonNonGeoDatasets
, FLAIR2 exposes ause_toy
anduse_sentinel
argument. Theuse_toy
-flag will instead use the toy data which is a small subset of data. Theuse_sentinel
argument on the other hand decides whether a sample includes the augmented sentinel data provided by the maintainers of FLAIR2._verify
,_download
,_extract
As each of the splits/sample-types (i.e.
[train, test], [aerial, sentinel, labels]
are contained in a individual zip download, download and extraction has to happen multiple times. On the other hand, the toy dataset is contained in a singular zip. Furthermore, to map the super-patches of the sentinel data to the actual input image, aflair-2_centroids_sp_to_patch.json
is required, which has to be equally has to be downloaded as an individual zip._load_image
,_load_sentinel
,_load_target
For storage reasons, the elevation (5th band) of the image is stored as a uint. The original height thus is multiplied by 5. We decided to divide the height by 5 to get the original height, to make the trained model more usable for other data. See Questions please.
As mentioned previously, additional metadata has to be used to get from the$T \times C=10 \times W \times H$ vary both $T$ and $W$ , $H$ . This is problematic for the
sentinel.npy
to the actual area. Initially for debugging reasons, we implemented to return not the cropped image but the original data and the cropping-slices (i.e. indices). Consequently, the images can be plot in a more meaningful matter. Otherwise, the resolution is so low that one can hardly recognize features. This was crucial for debugging to find the correct logic (classic y, x instead of x, y ordering mistake). We do not know if this is smart for "production code". See Questions please.Moreover, the dimensions of the sentinel data
datamodule
. We have not done extensive research, but the varying dimensions seem to bug the module. Disabling theuse_sentinel
-flag will make the module work.The labels include values from 1 to 19. The datapaper clearly mentions grouping classes$> 13$ into one class
other
due to underrepresentation. We followed this suggestion. Furthermore, rescaling from 0 to 12 was applied. See Questions please.Questions
How shall we load/provide sentinel data? As cropped data or any other way. I do not see the current implementation as fit for production.
Shall we rescale the Classes to start from 0? Shall we group the classes as suggested in the datapaper?
Check integrity in
download_url
does not seem to work (in unit-tests), why?check_integrity
call otherwise it passes, even if md5s do not match.The github actions on the forked repo produce a magic ruff error (https://github.com/MathiasBaumgartinger/torchgeo/actions/runs/11687694109/job/32556175383#step:7:1265). Can you help me resolve this mystery?
TODOs/FIXMEs