Skip to content

Commit

Permalink
Changes to BreaKHis & GleasonArvaniti (#759)
Browse files Browse the repository at this point in the history
  • Loading branch information
nkaenzig authored Feb 13, 2025
1 parent 17549e3 commit c59d1e5
Show file tree
Hide file tree
Showing 12 changed files with 48 additions and 66 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ model:
class_path: torch.nn.Linear
init_args:
in_features: ${oc.env:IN_FEATURES, 384}
out_features: &NUM_CLASSES 8
out_features: &NUM_CLASSES 4
criterion: torch.nn.CrossEntropyLoss
optimizer:
class_path: torch.optim.AdamW
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,6 @@ trainer:
dataloader_idx_map:
0: train
1: val
2: test
backbone:
class_path: eva.vision.models.ModelFromRegistry
init_args:
Expand Down Expand Up @@ -84,11 +83,6 @@ data:
init_args:
<<: *DATASET_ARGS
split: val
test:
class_path: eva.datasets.EmbeddingsClassificationDataset
init_args:
<<: *DATASET_ARGS
split: test
predict:
- class_path: eva.vision.datasets.GleasonArvaniti
init_args: &PREDICT_DATASET_ARGS
Expand All @@ -103,10 +97,6 @@ data:
init_args:
<<: *PREDICT_DATASET_ARGS
split: val
- class_path: eva.vision.datasets.GleasonArvaniti
init_args:
<<: *PREDICT_DATASET_ARGS
split: test
dataloaders:
train:
batch_size: &BATCH_SIZE ${oc.env:BATCH_SIZE, 256}
Expand All @@ -115,9 +105,6 @@ data:
val:
batch_size: *BATCH_SIZE
num_workers: *N_DATA_WORKERS
test:
batch_size: *BATCH_SIZE
num_workers: *N_DATA_WORKERS
predict:
batch_size: &PREDICT_BATCH_SIZE ${oc.env:PREDICT_BATCH_SIZE, 64}
num_workers: *N_DATA_WORKERS
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ model:
class_path: torch.nn.Linear
init_args:
in_features: ${oc.env:IN_FEATURES, 384}
out_features: &NUM_CLASSES 8
out_features: &NUM_CLASSES 4
criterion: torch.nn.CrossEntropyLoss
optimizer:
class_path: torch.optim.AdamW
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -80,11 +80,6 @@ data:
init_args:
<<: *DATASET_ARGS
split: val
test:
class_path: eva.vision.datasets.GleasonArvaniti
init_args:
<<: *DATASET_ARGS
split: test
dataloaders:
train:
batch_size: &BATCH_SIZE ${oc.env:BATCH_SIZE, 256}
Expand All @@ -93,6 +88,3 @@ data:
val:
batch_size: *BATCH_SIZE
num_workers: *N_DATA_WORKERS
test:
batch_size: *BATCH_SIZE
num_workers: *N_DATA_WORKERS
17 changes: 9 additions & 8 deletions docs/datasets/breakhis.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,9 @@

The Breast Cancer Histopathological Image Classification (BreakHis) is composed of 9,109 microscopic images of breast tumor tissue collected from 82 patients using different magnifying factors (40X, 100X, 200X, and 400X). For this benchmark we only use the 40X samples which results in a subset of 1,995 images. This database has been built in collaboration with the P&D Laboratory, Pathological Anatomy and Cytopathology, Parana, Brazil.

The dataset is divided into two main groups: benign tumors and malignant tumors. The dataset currently contains four histological distinct types of benign breast tumors: adenosis (A), fibroadenoma (F), phyllodes tumor (PT), and tubular adenona (TA); and four malignant tumors (breast cancer): carcinoma (DC), lobular carcinoma (LC), mucinous carcinoma (MC) and papillary carcinoma (PC).
The dataset is divided into two main groups: benign tumors and malignant tumors. The original dataset contains four histological distinct types of benign breast tumors: adenosis (A), fibroadenoma (F), phyllodes tumor (PT), and tubular adenona (TA); and four malignant tumors (breast cancer): carcinoma (DC), lobular carcinoma (LC), mucinous carcinoma (MC) and papillary carcinoma (PC).

Given that patient counts for some classes are very low (e.g. 3 for PT), we only use classes with at least 7 patients for this benchmark: TA, MC, F & DC.

## Raw data

Expand All @@ -11,24 +13,23 @@ The dataset is divided into two main groups: benign tumors and malignant tumors.
| | |
|--------------------------------|-----------------------------|
| **Modality** | Vision (WSI patches) |
| **Task** | Multiclass classification (8 classes) |
| **Task** | Multiclass classification (4 classes) |
| **Cancer type** | Breast |
| **Data size** | 4 GB |
| **Image dimension** | 700 x 460 |
| **Magnification (μm/px)** | 40x (0.25) |
| **Files format** | `png` |
| **Number of images** | 1995 |
| **Number of images** | 1471 |


### Splits

The data source provides train/validation splits
The data source provides train/validation splits. There is no overlap of patients between the splits, and a stratified distribution of the classes is approximated (extact stratification is not possible due to the patient separation constraint).

| Splits | Train | Validation |
|----------|---------------|--------------|
| #Samples | 1393 (70%) | 602 (30%) |
| Splits | Train | Validation |
|----------|------------------|-----------------|
| #Samples | 1132 (76.95%) | 339 (23.04%) |

A test split is not provided, as by further dividing the dataset the number of samples per class becomes too low for robust evaluations. __eva__ therefore reports evaluation results for BreakHis on the validation split.


### Organization
Expand Down
8 changes: 4 additions & 4 deletions docs/datasets/gleason_arvaniti.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,14 +22,14 @@ Images are classified as benign, Gleason pattern 3, 4 or 5. The dataset contains

### Splits

We use the same splits as proposed in the paper:
The following splits are proposed in the paper:

| Splits | Train | Validation | Test |
|---|---------------|--------------|--------------|
| Splits | Train | Validation | Test |
|----------|-----------------|----------------|----------------|
| #Samples | 15,303 (67.26%) | 2,482 (10.91%) | 4,967 (21.83%) |

Note that the authors chose TMA 76 as validation cohort because it contains the most balanced distribution of Gleason scores.

We couldn't achieve stable results when evaluating on the test set, so we only use the train and validation sets for this benchmark.

## Download and preprocessing
The `GleasonArvaniti` dataset class doesn't download the data during runtime and must be downloaded and preprocessed manually:
Expand Down
2 changes: 1 addition & 1 deletion docs/datasets/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
|------------------------------------|----------|-------------|------------------------|----------------------------|------------------|
| [BACH](bach.md) | 400 | 2048x1536 | 20x (0.5) | Classification (4 classes) | Breast |
| [BRACS](bracs.md) | 4539 | variable | 40x (0.25) | Classification (7 classes) | Breast |
| [BreakHis](breakhis.md) | 1995 | 700x460 | 40x (0.25) | Classification (8 classes) | Breast |
| [BreakHis](breakhis.md) | 1471 | 700x460 | 40x (0.25) | Classification (4 classes) | Breast |
| [CRC](crc.md) | 107,180 | 224x224 | 20x (0.5) | Classification (9 classes) | Colorectal |
| [GleasonArvaniti](crc.md) | 22,752 | 750x750 | 40x (0.23) | Classification (4 classes) | Prostate |
| [PatchCamelyon](patch_camelyon.md) | 327,680 | 96x96 | 10x (1.0) \* | Classification (2 classes) | Breast |
Expand Down
55 changes: 25 additions & 30 deletions src/eva/vision/data/datasets/classification/breakhis.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
import functools
import glob
import os
from typing import Callable, Dict, List, Literal, Set
from typing import Any, Callable, Dict, List, Literal, Set

import torch
from torchvision import tv_tensors
Expand All @@ -28,37 +28,26 @@ class BreaKHis(base.ImageClassification):

_val_patient_ids: Set[str] = {
"18842D",
"16184",
"8168",
"4372",
"16716",
"9146",
"21978AB",
"6241",
"17901",
"12465",
"3411F",
"18842",
"2980",
"15570C",
"2985",
"13413",
"3909",
"14134E",
"2523",
"19854C",
"19979",
"29960CD",
"21998AB",
"29960AB",
"14946",
"15275",
"15792",
"16875",
"3909",
"5287",
"16716",
"2773",
"5695",
"16184CD",
"23060CD",
"21998CD",
"21998EF",
}
"""Patient IDs to use for dataset splits."""

_expected_dataset_lengths: Dict[str | None, int] = {
"train": 1393,
"val": 602,
None: 1995,
"train": 1132,
"val": 339,
None: 1471,
}
"""Expected dataset lengths for the splits and complete dataset."""

Expand Down Expand Up @@ -106,7 +95,7 @@ def __init__(
@property
@override
def classes(self) -> List[str]:
return ["A", "F", "PT", "TA", "DC", "LC", "MC", "PC"]
return ["TA", "MC", "F", "DC"]

@property
@override
Expand Down Expand Up @@ -151,8 +140,8 @@ def validate(self) -> None:
_validators.check_dataset_integrity(
self,
length=self._expected_dataset_lengths[self._split],
n_classes=8,
first_and_last_labels=("A", "PC"),
n_classes=4,
first_and_last_labels=("TA", "DC"),
)

@override
Expand All @@ -165,6 +154,10 @@ def load_target(self, index: int) -> torch.Tensor:
class_name = self._extract_class(self._image_files[self._indices[index]])
return torch.tensor(self.class_to_idx[class_name], dtype=torch.long)

@override
def load_metadata(self, index: int) -> Dict[str, Any]:
return {"patient_id": self._extract_patient_id(self._image_files[self._indices[index]])}

@override
def __len__(self) -> int:
return len(self._indices)
Expand Down Expand Up @@ -200,6 +193,8 @@ def _make_indices(self) -> List[int]:
val_indices = []

for index, image_file in enumerate(self._image_files):
if self._extract_class(image_file) not in self.classes:
continue
patient_id = self._extract_patient_id(image_file)
if patient_id in self._val_patient_ids:
val_indices.append(index)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@

import pandas as pd
import torch
from loguru import logger
from torchvision import tv_tensors
from typing_extensions import override

Expand Down Expand Up @@ -100,6 +101,12 @@ def prepare_data(self) -> None:
if not os.path.isdir(os.path.join(self._root, "test_patches_750")):
raise FileNotFoundError(f"`test_patches_750` directory not found in {self._root}")

if self._split == "test":
logger.warning(
"The test split currently leads to unstable evaluation results. "
"We recommend using the validation split instead."
)

@override
def configure(self) -> None:
self._indices = self._make_indices()
Expand Down

0 comments on commit c59d1e5

Please sign in to comment.