-
Notifications
You must be signed in to change notification settings - Fork 375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support FLAIR Datamodule for semantic segmentation #2303
Comments
Yes, we would love to have FLAIR in TorchGeo! If you would like to take a stab at this, see https://torchgeo.readthedocs.io/en/stable/user/contributing.html#datasets for a list of files you'll need to modify. We have dozens of other semantic segmentation datasets you can base your code on. Let us know if you have any trouble with the testing and we would be happy to help! |
Interestingly, I have just been working with FLAIR and they happily gave an admission for using their dataset in torchgeo. I am currently working on releasing it as a datamodule, however, there are some flaws with the current state of the FLAIR dataset which make them somewhat hard to integrate (see @#2292). |
I thought we solved all the problems in #2292? |
Absolutely my bad! I linked the wrong issue: rasterio/rasterio#3178 and OSGeo/gdal#10820. As discussed there, the FLAIR dataset currently provides an geographically underspecified mask dataset. While there is a workaround for this (for some reason i found both a My initial plan was to wait for the maintainers to fix the problem and release the module afterwards. However, if you think that would be appropriate, I could integrate the GDAL operations in the |
The images are already pre-chipped. Any reason you don't want to make a NonGeoDataset? |
Not necessarily. I just thought if there is geo-information, I might as well make it accessible. Do you prefer it as |
I find |
Alright! Another thing that comes to mind is that the most recent release (FLAIR#2) jointly includes preprocessed SENTINEL-2 data alongside aerial and mask images. A short summary of the changes mentioned in the datapaper:
With the considerations above, I suppose it would be better to include the data provided by the maintainers of FLAIR as opposed to the original SENTINEL-2 dataset of torchgeo using an intersection dataset. Any thoughts on that? |
Since the filenames and file formats are completely different from raw Sentinel-2 data, we would either have to create a new class for |
Hello, As a maintainer of the FLAIR dataset at IGN, i greatly appreciate your effort in integrating our dataset. Seeing this issue i would like to give you some information about the release of a new version of FLAIR in the next weeks. Among others, it will spatially align (patch-wise, so no super-areas any more) multiple modalities, included Sentinel-2, with common file formats. This new release will also have a bigger scale (about 3 times the current size). You might consider waiting until this new version is released before putting in the effort of integrating the FLAIR#2 Sentinel-2 imagery. Aerial imagery will stay in the same format. We are happy to provide any information or support that can help in this effort. |
@agarioud, nice to hear directly from you! I've already finished a reasonably working version, so the effort to finish everything (mainly documentation/cleanup) would be very small. Can you give me a more specific time frame in which you plan to release? Also, are you willing to share the pre-processing steps applied on the Sentinel-2 data? In a released version I would like to be able to perform those processing steps on other Sentinel-2 data as well to achieve maximum performance during prediction in unseen areas. |
Unfortunately i cannot give you an exact time frame. We are currently working on preparing the data, and as for the previous releases, we would like to add some documentation to it. I'll notify you as soon as we have more visibility. Regarding pre-processing of Sentinel-2 we have the following steps : we use BOA L2A data, cropped to the aerial patch extent (which is 512px at 0.2m so 102.4x102.4 m), resample to 10.24 m the Sentinel-2 to have 10x10 pixels patches. For each patch we stacked the 10 spectral bands of each dates (i.e., if 38 acquisitions, the patch has 380 channels) together to reduce inode footprint of the dataset. If you need more precise information you can contact me : flair@ign.fr Also, we will release a new batch of pre-trained models on the new dataset on our HuggingFace IGNF page. |
I forgot to say that we store snow and cloud masks as separate files. Also, the acquisition dates are stored in a JSON file for each area. |
Do you think it would be useful to have a single dataset with a |
The new release will include all previous areas and data but extend to other areas and other modalities. As such, i think a versioning is not necessary, rather than a 'area/patch' selection corresponding to the FLAIR#1 and #2 versions ? That being said, if one would like to include the super-areas of FLAIR#2, this would need some specific dataloading. |
The Clay model has a location encoder and could utilize the geographic information. I think a GeoDataset would be more valuable in the long run for models the can accept inputs beyond images. It also provides useful context for sampling and evaluation. |
That is pretty much my initial thought. Lots of research trying to utilize information beyond just color channels. So concluding: I will create a first pull request using a In any case I will try my best to update the module ASAP once @agarioud and his team release the new FLAIR dataset with properly specified CRS on the masks. |
Note that it is possible to return lat/long coords from a NonGeoDataset. The difference (in my mind) between that and a GeoDataset is storing all bounding boxes in a spatiotemporal R-tree. This can be slower, but makes it easier to sample small patches from large tiles or to combine the dataset with other GeoDatasets. |
I think @nilsleh needs a FLAIR data loader for some of his work. From our side, we would love to see a version 1 data loader in the near future that can later be converted to a version 2 once the new dataset is released. |
Hi! I had a packed schedule the last few weeks. I think I can work on refining my first draft and create a first PRQ for review tomorrow. |
FYI: I have been working on the FLAIR dataset yesterday and today. However, the integration of the sentinel data (which I have not used before) sadly turns out far more complicated than I hoped. You can see my progress at: https://github.com/MathiasBaumgartinger/torchgeo |
What were the challenges? |
Well, what took me the most time was a classic Happy to share a first draft: #2394 📨 |
Summary
I just stumbled across FLAIR, a high res (.2 meter) semantic segmentation dataset of 19 land cover categories. Full description here https://ignf.github.io/FLAIR/#FLAIR1 That link also references a U-Net model baseline trained on FLAIR
It is described in an OGC Report on ML Engineering as "a comprehensive and high-quality collection of labeled satellite imagery aimed at advancing land cover classification and geospatial analysis tasks". It's maintained by the French National Institute of Geographic and Forest Information (IGN).
Rationale
I'm interested in composing a list of high quality, challenging datasets for benchmarking semantic segmentation and object detection models and at a glance FLAIR seems like one of them. It seems like the rigor of maintenance and description of the FLAIR dataset is high compared to other datasets and so I would like to see this available in torchgeo.
Also, I think I recall seeing that torchgeo would like to offer models that are inference-ready, not just pretrained with self-superivison but fine-tuned to address popular tasks in remote sensing. This U-Net model seems like a good start, but I can raise a separate issue for adding the model.
Implementation
I haven't contributed to torchgeo before but I would check out other PRs that added datamodules and pretrained models and follow that example.
Alternatives
No response
Additional information
This comment is on my mind: Clay-foundation/model#269 (comment)
I'd love for more challenging datasets to be front and center when benchmarking models rather than Eurosat. I'd also like for us to use datasets that we have a solid understanding of the geographic and class distribution and I like that FLAIR lays this out on their site.
I might be a bit slow to implement this but would like to submit a PR when I have time if this sounds like a good idea.
The text was updated successfully, but these errors were encountered: