Datasets

flooder.datasets.datasets

Implementation of datasets used in the original Flooder paper.

BaseDataset

BaseDataset(
    root: str,
    fixed_transform: Callable[[FlooderData], FlooderData]
    | None = None,
    transform: Callable[[FlooderData], FlooderData]
    | None = None,
)

Bases: Dataset

Base class for Flooder datasets with download/process/load lifecycle.

This class provides a dataset API inspired by torch_geometric.data.Dataset, including: - a standard directory layout (root/raw and root/processed); - a lifecycle executed at construction time: download, process, load; - integer indexing to return items, and advanced indexing to return a subset "view" of the dataset; - optional per-sample transformations.

Subclasses must implement the abstract properties/methods that specify file requirements and dataset-specific loading logic.

Attributes:

Name	Type	Description
`root`	`str`	Root directory containing the dataset folders.
`fixed_transform`	`Callable[[FlooderData], FlooderData] \| None`	Optional transform applied once to each item during `_load()` (i.e., at dataset load time, before storing in memory).
`transform`	`Callable[[FlooderData], FlooderData] \| None`	Optional transform applied on-the-fly in `__getitem__` for individual samples.
`_indices`	`Sequence[int] \| None`	If not None, defines a subset view over the underlying dataset indices.

Notes

The constructor triggers _download(), _process(), and _load(). This means instantiation may perform I/O and compute.
Advanced indexing (slice, sequences, boolean masks) returns a shallow-copied dataset object sharing the same underlying storage (whatever the subclass uses), but with _indices set.

Initialize a dataset and execute the download/process/load lifecycle.

Parameters:

Name	Type	Description	Default
`root`	`str`	Root directory where raw and processed data are stored.	required
`fixed_transform`	`Callable[[FlooderData], FlooderData] \| None`	Optional transform applied once to each item during `_load()`.	`None`
`transform`	`Callable[[FlooderData], FlooderData] \| None`	Optional transform applied on-the-fly in `__getitem__`.	`None`

Notes

Instantiation may perform I/O and compute by calling _download(), _process(), and _load().

processed_dir `property`

processed_dir: str

Directory containing processed files.

Returns:

Name	Type	Description
`str`	`str`	Path to `<root>/processed`.

processed_file_names `property`

processed_file_names: Union[str, List[str], Tuple[str, ...]]

Required processed files to consider the dataset processed.

Subclasses should return the file name(s) expected to exist in processed_dir. If all such files exist, _process() is skipped.

Returns:

Type	Description
`Union[str, List[str], Tuple[str, ...]]`	Union[str, list[str], tuple[str, ...]]: File name(s) expected
`Union[str, List[str], Tuple[str, ...]]`	inside `self.processed_dir`.

Raises:

Type	Description
`NotImplementedError`	If not implemented by a subclass.

processed_paths `property`

processed_paths: List[str]

Absolute paths to required processed files.

Returns:

Type	Description
`List[str]`	list[str]: List of absolute paths for `processed_file_names` under
`List[str]`	`processed_dir`.

raw_dir `property`

raw_dir: str

Directory containing raw downloaded files.

Returns:

Name	Type	Description
`str`	`str`	Path to `<root>/raw`.

raw_file_names `property`

raw_file_names: Union[str, List[str], Tuple[str, ...]]

Required raw files to consider the dataset downloaded.

Subclasses should return the file name(s) expected to exist in raw_dir. If all such files exist, _download() is skipped.

Returns:

Type	Description
`Union[str, List[str], Tuple[str, ...]]`	Union[str, list[str], tuple[str, ...]]: File name(s) expected
`Union[str, List[str], Tuple[str, ...]]`	inside `self.raw_dir`.

Raises:

Type	Description
`NotImplementedError`	If not implemented by a subclass.

raw_paths `property`

raw_paths: List[str]

Absolute paths to required raw files.

Returns:

Type	Description
`List[str]`	list[str]: List of absolute paths for `raw_file_names` under
`List[str]`	`raw_dir`.

getitem

__getitem__(
    idx: Union[int, integer, IndexType],
) -> "FlooderData | BaseDataset"

Get an item or a subset of the dataset.

Behavior depends on the type of idx:

If idx is an integer (Python int, np.integer, 0-dim Tensor, or scalar np.ndarray), returns a single FlooderData object corresponding to the view index idx.
Otherwise, returns a subset view of the dataset created via index_select(idx).

If transform is set, it is applied on-the-fly to single-item access.

Parameters:

Name	Type	Description	Default
`idx`	`int \| integer \| slice \| Tensor \| ndarray \| Sequence`	Index or indices selecting items.	required

Returns:

Type	Description
`'FlooderData \| BaseDataset'`	FlooderData \| BaseDataset: A single data object if `idx` is scalar-like, otherwise a `BaseDataset` subset view.

Raises:

Type	Description
`IndexError`	If `idx` type is unsupported (delegated to `index_select`).

iter

__iter__() -> Iterator[FlooderData]

Iterate over items in the current dataset view.

Yields:

Name	Type	Description
`FlooderData`	`FlooderData`	Items in order from `0` to `len(self) - 1`, with
	`FlooderData`	`transform` applied if configured.

len

__len__() -> int

Return the number of examples in the current dataset view.

For a full dataset this equals self.len(). For a subset view this equals len(self._indices).

Returns:

Name	Type	Description
`int`	`int`	Number of examples exposed by this dataset instance.

download

download() -> None

Download the dataset into raw_dir.

Subclasses must implement the dataset-specific download logic. This method is called by _download() only if the required files in raw_paths are not present.

Raises:

Type	Description
`NotImplementedError`	If not implemented by a subclass.

get

get(idx: int) -> FlooderData

Return the data object at a given global index.

idx refers to the underlying dataset index, not the subset-view index. The subset mapping is handled by __getitem__ via indices().

Parameters:

Name	Type	Description	Default
`idx`	`int`	Global index into the underlying dataset storage.	required

Returns:

Name	Type	Description
`FlooderData`	`FlooderData`	The data object at the given index.

Raises:

Type	Description
`NotImplementedError`	If not implemented by a subclass.

index_select

index_select(idx: IndexType) -> 'BaseDataset'

Create a subset view of the dataset from specified indices.

Supported index types: - slice: includes support for float boundaries, e.g. dataset[:0.9], interpreted as a fraction of the current view length. - torch.Tensor of dtype long: treated as integer indices. - torch.Tensor of dtype bool: treated as a boolean mask. - np.ndarray of dtype int64: treated as integer indices. - np.ndarray of dtype bool: treated as a boolean mask. - Sequence (excluding str): treated as a list of integer indices.

The returned dataset is a shallow copy of self with _indices set to map view indices to global indices.

Parameters:

Name	Type	Description	Default
`idx`	`slice \| Sequence[int] \| Tensor \| ndarray`	Indices specifying the subset.	required

Returns:

Name	Type	Description
`BaseDataset`	`'BaseDataset'`	A subset view of the dataset.

Raises:

Type	Description
`IndexError`	If `idx` is not one of the supported types or has an unsupported dtype.

indices

indices() -> Sequence

Return the active index mapping for this dataset view.

For a full dataset (no subset), this is range(self.len()). For a subset view created via index_select, this is the stored _indices sequence.

Returns:

Type	Description
`Sequence`	Sequence[int]: Index mapping from view indices to global indices.

len

len() -> int

Return the number of items in the full dataset.

This method is analogous to torch_geometric.data.Dataset.len(). It should return the total number of items in the underlying dataset, not the size of a subset view created via index_select.

Returns:

Name	Type	Description
`int`	`int`	Total number of data objects stored by the dataset.

Raises:

Type	Description
`NotImplementedError`	If not implemented by a subclass.

process

process() -> None

Process raw files into processed_dir.

Subclasses must implement the dataset-specific processing logic. This method is called by _process() only if the required files in processed_paths are not present.

Raises:

Type	Description
`NotImplementedError`	If not implemented by a subclass.

shuffle

shuffle(
    return_perm: bool = False,
) -> "BaseDataset | Tuple[BaseDataset, Tensor]"

Return a shuffled subset view of the dataset.

This method generates a random permutation of the current dataset view and returns a subset view with that ordering.

Parameters:

Name	Type	Description	Default
`return_perm`	`bool`	If True, also return the permutation tensor.	`False`

Returns:

Type	Description
`'BaseDataset \| Tuple[BaseDataset, Tensor]'`	BaseDataset \| tuple[BaseDataset, torch.Tensor]: If `return_perm` is False, returns the shuffled dataset view. If True, returns `(dataset, perm)` where `perm` is a 1D long tensor of indices into the current view.

CoralDataset

CoralDataset(
    root: str,
    fixed_transform: Callable[[FlooderData], FlooderData]
    | None = None,
    transform: Callable[[FlooderData], FlooderData]
    | None = None,
)

Bases: FlooderDataset

Coral point-cloud dataset used in the Flooder paper.

This dataset consists of 81 point clouds, each comprising 1 million points uniformly sampled from surface meshes of coral specimens provided by the Smithsonian 3D Digitization program (https://3d.si.edu/corals). Labels correspond to the coral's genus, with 31 Acroporidae samples (label 0) and 52 Poritidae samples (label 1).

The dataset is distributed as a compressed archive (corals.tar.zst) hosted on Google Drive. The archive is downloaded, validated via SHA256, extracted into raw_dir/<folder_name>/, and processed into per-sample .pt files stored in processed_dir.

Each raw sample is stored as a .npy array that is loaded and normalized by dividing by 32767, and cast to float32. Labels are read from the dataset metadata (meta.yaml) under ydata['data'][<filename>]['label'].

The processed sample is represented as

x: torch.FloatTensor containing the point cloud
y: integer class label
name: sample identifier derived from the file stem

Directory structure (expected after extraction): raw_dir/corals/ meta.yaml splits.yaml *.npy

See Also

FlooderDataset: Implements the download/process/load lifecycle.

checksum `property`

checksum: str

Expected SHA256 checksum of the downloaded archive.

Returns:

Name	Type	Description
`str`	`str`	Lowercase hex-encoded SHA256 digest for `corals.tar.zst`.

file_id `property`

file_id: str

Google Drive file id for the dataset archive.

Returns:

Name	Type	Description
`str`	`str`	Google Drive file id used to construct the download URL.

raw_file_names `property`

raw_file_names: list[str]

Name of the extracted raw folder under raw_dir.

Returns:

Name	Type	Description
`str`	`list[str]`	Folder name containing raw dataset files after extraction.

process_file

process_file(file: Path, ydata: dict) -> None

Convert a raw .npy file into a FlooderData example.

Loads the point cloud from file using numpy.load, normalizes values by dividing by 32767, casts to float32, and converts to a PyTorch tensor. The class label is read from metadata.

Parameters:

Name	Type	Description	Default
`file`	`Path`	Path to the raw `.npy` file to process.	required
`ydata`	`dict`	Parsed YAML metadata from `meta.yaml`. Must contain an entry `ydata['data'][file.name]['label']`.	required

Returns:

Name	Type	Description
`FlooderData`	`None`	Processed example with fields `(x, y, name)`.

Raises:

Type	Description
`KeyError`	If the expected label entry is missing from `ydata`.
`OSError`	If the `.npy` file cannot be read.
`ValueError`	If the `.npy` content cannot be converted to `float32`.

FlooderDataset

FlooderDataset(
    root: str,
    fixed_transform: Callable[[FlooderData], FlooderData]
    | None = None,
    transform: Callable[[FlooderData], FlooderData]
    | None = None,
)

Bases: BaseDataset

Base class for Flooder paper datasets distributed as compressed archives.

This dataset class implements a standard pipeline:

1) Download a .tar.zst archive from Google Drive (via gdown), identified by file_id, and validate it with a SHA256 checksum. 2) Decompress and extract the archive into raw_dir/<folder_name>/. 3) Read dataset metadata (meta.yaml) and split definitions (splits.yaml) from the extracted raw folder. 4) Convert each raw .npy file into a FlooderData-like object via process_file(...), and store it as a .pt file in processed_dir. 5) Load all .pt files into memory (self.data) in _load(), optionally applying fixed_transform once per sample at load time.

Subclasses are expected to define

file_id: Google Drive file id
checksum: expected SHA256 checksum of the downloaded archive
folder_name: name of the extracted folder under raw_dir
raw_file_names: name(s) of the downloaded raw archive file(s)
process_file(...): conversion logic from a .npy file and metadata

Attributes:

Name	Type	Description
`data`	`list[FlooderData]`	In-memory list of processed examples loaded from `.pt` files. Populated by `_load()`.
`splits`	`dict`	Split definitions loaded from `processed_dir/splits.yaml`. The structure depends on the dataset, but is typically a mapping from split identifier (e.g., fold index) to dicts containing keys like `'trn'`, `'val'`, `'tst'` with integer indices.
`classes`	`list[int]`	Sorted list of unique class labels observed across the dataset (computed after loading).
`num_classes`	`int`	Number of unique classes.

checksum `property`

checksum: str

Expected SHA256 checksum of the downloaded archive.

The checksum is used by validate(...) after download. If the computed SHA256 does not match, a warning is emitted.

Returns:

Name	Type	Description
`str`	`str`	Lowercase hex-encoded SHA256 digest of the expected archive.

Raises:

Type	Description
`NotImplementedError`	If not implemented by a subclass.

file_id `property`

file_id: str

Google Drive file id for the dataset archive.

Subclasses must provide the id used to construct the download URL: https://drive.google.com/uc?id=<file_id>.

Returns:

Name	Type	Description
`str`	`str`	The Google Drive file id.

Raises:

Type	Description
`NotImplementedError`	If not implemented by a subclass.

folder_name `property`

folder_name: str

Name of the extracted folder under raw_dir.

After extraction, the expected raw folder is raw_dir/<folder_name>/ containing meta.yaml, splits.yaml, and the .npy files.

Returns:

Name	Type	Description
`str`	`str`	Folder name within `raw_dir`.

Raises:

Type	Description
`NotImplementedError`	If not implemented by a subclass.

processed_file_names `property`

processed_file_names: list[str]

Processed-file sentinel list for determining whether processing is done.

The default convention for Flooder datasets is

_done: an empty file indicating processing completion
splits.yaml: split definitions copied to the processed directory

Returns:

Type	Description
`list[str]`	list[str]: List of required processed file names.

download

download() -> None

Download the dataset archive from Google Drive into raw_dir.

Constructs a Google Drive download URL from file_id and downloads into raw_dir/<raw_file_names[0]> using gdown. After downloading, calls validate(...) to check integrity.

Raises:

Type	Description
`IndexError`	If `raw_file_names` is empty.
`OSError`	If the destination file cannot be written.
`Exception`	Propagates errors from `gdown.download(...)`.

get

get(idx: int) -> FlooderData

Return the in-memory data object at the given global index.

This implementation assumes _load() has populated self.data with objects saved in processed_dir as .pt files.

Parameters:

Name	Type	Description	Default
`idx`	`int`	Global index into `self.data`.	required

Returns:

Name	Type	Description
`FlooderData`	`FlooderData`	The data item at `idx`.

get_split_indices

get_split_indices(splits_data) -> dict

Extract split indices from raw splits.yaml content.

The default behavior expects the raw splits.yaml to contain a top-level key "splits" holding the split definitions.

Subclasses may override this method if their splits.yaml uses a different schema.

Parameters:

Name	Type	Description	Default
`splits_data`	`dict`	Parsed YAML content from `splits.yaml`.	required

Returns:

Name	Type	Description
`Any`	`dict`	The split indices structure to be saved into
	`dict`	`processed_dir/splits.yaml`. Typically a dict mapping fold id to split
	`dict`	dicts (`'trn'`, `'val'`, `'tst'`), but may vary by dataset.

len

len() -> int

Return the number of examples in the full dataset.

Returns:

Name	Type	Description
`int`	`int`	Total number of examples, equal to `len(self.data)` after `_load()`.

process

process() -> None

Process the extracted raw dataset into serialized .pt files.

Processing performs the following steps:

1) Ensure the archive has been extracted into raw_dir/<folder_name>/. 2) Load metadata from meta.yaml and split definitions from splits.yaml. 3) Save extracted split indices into processed_dir/splits.yaml. 4) Iterate over all .npy files in the extracted folder, sorted by name. 5) For each .npy, call process_file(file, ydata) and save the returned object as <stem>.pt in processed_dir. 6) Create the _done sentinel file in processed_dir.

Raises:

Type	Description
`FileNotFoundError`	If required raw files (`meta.yaml`, `splits.yaml`, or `.npy` files) are missing.
`YAMLError`	If YAML parsing fails.
`OSError`	For I/O errors reading raw files or writing processed files.
`RuntimeError`	If `torch.save` fails for a produced object.

process_file

process_file(file: Path, ydata: dict) -> None

Convert a raw .npy file into a FlooderData-like object.

Subclasses must implement the dataset-specific logic for reading file (typically via numpy.load) and for producing an instance of FlooderData (or a subclass like FlooderRocksData).

Parameters:

Name	Type	Description	Default
`file`	`Path`	Path to a `.npy` file inside the extracted raw folder.	required
`ydata`	`dict`	Metadata loaded from `meta.yaml`. The structure is dataset- dependent but typically contains labels and other targets keyed by file name.	required

Returns:

Name	Type	Description
`FlooderData`	`None`	Processed data object to be saved as a `.pt` file.

Raises:

Type	Description
`NotImplementedError`	If not implemented by a subclass.

unzip_file

unzip_file() -> None

Decompress and extract the dataset archive into raw_dir.

This method reads the first file in raw_paths as a .tar.zst archive, decompresses it using zstandard, and extracts it using tarfile.

Extraction behavior depends on Python version

Python >= 3.12: uses tar.extractall(..., filter='data') to apply tarfile's safety filter.
Older versions: falls back to tar.extractall(...).

Raises:

Type	Description
`FileNotFoundError`	If `raw_paths[0]` does not exist.
`TarError`	If the archive is invalid or cannot be read.
`ZstdError`	If decompression fails.
`OSError`	For I/O errors during reading or extraction.

Security

Be careful extracting archives from untrusted sources. While Python 3.12's filter='data' mitigates some risks, older versions extract without filtering.

validate

validate(file_path: Path) -> None

Validate a downloaded archive against the expected SHA256 checksum.

Computes the SHA256 digest of file_path and compares it to self.checksum. If they do not match, emits a UserWarning.

Parameters:

Name	Type	Description	Default
`file_path`	`Path \| str`	Path to the downloaded archive.	required

Warns:

Type	Description
`UserWarning`	If the computed checksum differs from the expected checksum.

Raises:

Type	Description
`FileNotFoundError`	If `file_path` does not exist.
`OSError`	For I/O errors reading the file.

LargePointCloudDataset

LargePointCloudDataset(
    root: str,
    fixed_transform: Callable[[FlooderData], FlooderData]
    | None = None,
    transform: Callable[[FlooderData], FlooderData]
    | None = None,
)

Bases: FlooderDataset

LargePointCloudDataset dataset with large-scale point-clouds used in the Flooder paper.

This dataset contains two point clouds with more than 10M points each, distributed as a compressed .tar.zst archive and hosted on Google Drive. The archive is downloaded, validated via SHA256, extracted into raw_dir/<folder_name>/, and processed into per-sample .pt files.

The processed sample are stored as a LargePointCloudData dataclass with the following attributes

x: torch.FloatTensor of point coordinates
name: sample identifier
description: brief description of the point cloud

Expected extracted raw directory structure

raw/large/ meta.yaml coral.pt virus.pt

See Also

FlooderDataset: Implements the shared download, processing, and loading pipeline.

checksum `property`

checksum: str

Expected SHA256 checksum of the downloaded archive.

Returns:

Name	Type	Description
`str`	`str`	Lowercase hex-encoded SHA256 digest for `large.tar.zst`.

file_id `property`

file_id: str

Google Drive file id for the LargePointCloudDataset dataset archive.

Returns:

Name	Type	Description
`str`	`str`	Google Drive file id used to construct the download URL.

folder_name `property`

folder_name: str

Name of the extracted raw folder under raw_dir.

Returns:

Name	Type	Description
`str`	`str`	Folder name containing the extracted large dataset files.

raw_file_names `property`

raw_file_names: list[str]

Raw archive file name(s) expected in raw_dir.

Returns:

Type	Description
`list[str]`	list[str]: List containing the dataset archive file name.

uncompressed_file_names `property`

uncompressed_file_names: list[str]

Uncompressed file name(s) expected in raw_dir.

Returns:

Type	Description
`list[str]`	list[str]: List containing the uncompressed file names.

get

get(idx) -> LargePointCloudData

Return the data object at a given index (either 0 or 1).

Parameters:

Name	Type	Description	Default
`idx`	`int`	Index to access.	required

Returns:

Name	Type	Description
`LargePointCloudData`	`LargePointCloudData`	The data object at the given index.

len

len() -> int

Return the number of items in the full dataset.

process

process() -> None

Extract the raw dataset.

MCBDataset

MCBDataset(
    root: str,
    fixed_transform: Callable[[FlooderData], FlooderData]
    | None = None,
    transform: Callable[[FlooderData], FlooderData]
    | None = None,
)

Bases: FlooderDataset

MCB point-cloud dataset used in the Flooder paper.

This dataset consists of 1745 point clouds, each comprising 1 million points uniformly sampled from surface meshes from a subset of the MCB dataset (A large-scale annotated mechanical components benchmark for classification and retrieval tasks with deep neural networks, ECCV, 2020) available at https://github.com/stnoah1/mcb.

The dataset is distributed as a compressed .tar.zst archive hosted on Google Drive. The archive is downloaded, validated using a SHA256 checksum, extracted into raw_dir/<folder_name>/, and processed into per-sample .pt files stored in processed_dir.

Each raw sample is stored as a .npy array containing quantized point coordinates. During processing, coordinates are normalized by dividing by 32767 and cast to float32.

The processed sample representation is

x: torch.FloatTensor of normalized point coordinates
y: integer class label
name: sample identifier derived from the file stem

Expected extracted raw directory structure

raw_dir/mcb/ meta.yaml splits.yaml *.npy

See Also

FlooderDataset: Implements the shared download, processing, and loading pipeline.

checksum `property`

checksum: str

Expected SHA256 checksum of the downloaded archive.

Returns:

Name	Type	Description
`str`	`str`	Lowercase hex-encoded SHA256 digest for `mcb.tar.zst`.

file_id `property`

file_id: str

Google Drive file id for the MCB dataset archive.

Returns:

Name	Type	Description
`str`	`str`	Google Drive file id used to construct the download URL.

folder_name `property`

folder_name: str

Name of the extracted raw folder under raw_dir.

Returns:

Name	Type	Description
`str`	`str`	Folder name containing the extracted MCB dataset files.

raw_file_names `property`

raw_file_names: list[str]

Raw archive file name(s) expected in raw_dir.

Returns:

Type	Description
`list[str]`	list[str]: List containing the dataset archive file name.

process_file

process_file(file: Path, ydata: dict) -> FlooderData

Convert a raw .npy file into a FlooderData example.

Loads the raw point cloud from file, normalizes coordinates by dividing by 32767, casts to float32, and converts to a PyTorch tensor. The class label is read from the dataset metadata.

Parameters:

Name	Type	Description	Default
`file`	`Path`	Path to the raw `.npy` file inside the extracted MCB dataset folder.	required
`ydata`	`dict`	Parsed YAML metadata from `meta.yaml`. Must contain an entry `ydata['data'][file.name]['label']`.	required

Returns:

Name	Type	Description
`FlooderData`	`FlooderData`	Processed example with fields `(x, y, name)`.

Raises:

Type	Description
`KeyError`	If the label entry for `file.name` is missing in `ydata`.
`OSError`	If the `.npy` file cannot be read.
`ValueError`	If the loaded array cannot be converted to `float32`.

ModelNet10Dataset

ModelNet10Dataset(
    root: str,
    fixed_transform: Callable[[FlooderData], FlooderData]
    | None = None,
    transform: Callable[[FlooderData], FlooderData]
    | None = None,
)

Bases: FlooderDataset

ModelNet10 point-cloud dataset (250k points) used in the Flooder paper.

This dataset consists of 4899 point clouds, each comprising 250k points uniformly sampled from surface meshes from the ModelNet10 dataset (Wu et al., 3D ShapeNets: A Deep Representation for Volumetric Shapes, CVPR 2015) available at https://modelnet.cs.princeton.edu/.

The dataset is distributed as a compressed .tar.zst archive hosted on Google Drive. The archive is downloaded, validated using a SHA256 checksum, extracted into raw_dir/<folder_name>/, and processed into per-sample .pt files stored in processed_dir.

Each raw sample is stored as a .npy array containing quantized point coordinates. During processing, coordinates are normalized by dividing by 32767 and cast to float32.

The processed sample representation is

x: torch.FloatTensor of normalized point coordinates
y: integer class label in [0, 9]
name: sample identifier derived from the file stem

Expected extracted raw directory structure

raw_dir/modelnet10_250k/ meta.yaml splits.yaml *.npy

See Also

FlooderDataset: Implements the shared download, processing, and loading pipeline.

RocksDataset

RocksDataset(
    root: str,
    fixed_transform: Callable[[FlooderData], FlooderData]
    | None = None,
    transform: Callable[[FlooderData], FlooderData]
    | None = None,
)

Bases: FlooderDataset

Rock voxel dataset converted to point clouds with geometric targets.

This synthetic dataset consists of 1000 3D binary voxel grids representing rock samples from two classes. The voxel grids are produced by the PoreSpy library (https://porespy.org/) with classes corresponding to the generation method, fractal noise and blobs, each with 500 samples.

The dataset is distributed as a compressed archive (rocks.tar.zst) hosted on Google Drive. During processing, each voxel grid is converted into a set of 3D points by extracting the coordinates of occupied voxels and adding small random jitter to break lattice the lattice structure.

In addition to the class label, each sample includes continuous targets such as surface area and volume.

Processed sample representation

x: torch.FloatTensor of shape (N, 3) containing point coordinates
y: integer class label
surface: float-valued surface area target
volume: float-valued volume target
name: sample identifier derived from the file stem

Expected extracted raw directory structure

raw_dir/rocks/ meta.yaml splits.yaml *.npy

See Also

FlooderDataset: Implements the shared download, processing, and loading pipeline.

process_file

process_file(file: Path, ydata: dict) -> FlooderRocksData

Convert a raw voxel .npy file into a FlooderRocksData example.

Processing steps

1) Load the bit-packed voxel array from file. 2) Unpack bits into a boolean array of shape (256, 256, 256). 3) Extract the indices of occupied voxels using np.where. 4) Convert voxel indices to float coordinates and add small random jitter to avoid degenerate lattice structure. 5) Attach label and continuous targets from metadata.

Parameters:

Name	Type	Description	Default
`file`	`Path`	Path to the raw `.npy` voxel file.	required
`ydata`	`dict`	Parsed YAML metadata from `meta.yaml`. Must contain entries for `label`, `target` (surface), and `volume` under `ydata['data'][file.name]`.	required

Returns:

Name	Type	Description
`FlooderRocksData`	`FlooderRocksData`	Processed example with fields
	`FlooderRocksData`	`(x, y, surface, volume, name)`.

Raises:

Type	Description
`KeyError`	If required metadata entries are missing.
`ValueError`	If the unpacked voxel array cannot be reshaped to `(256, 256, 256)`.
`OSError`	If the `.npy` file cannot be read.

SwisscheeseDataset

SwisscheeseDataset(
    root: str,
    ks: list[int] = [10, 20],
    num_per_class: int = 500,
    num_points: int = 1000000,
    fixed_transform: Callable[[FlooderData], FlooderData]
    | None = None,
    transform: Callable[[FlooderData], FlooderData]
    | None = None,
)

Bases: FlooderDataset

Synthetic "Swiss cheese" point-cloud dataset used in the Flooder paper.

This dataset is generated procedurally (no download). Each sample consists of points uniformly sampled from a 3D axis-aligned box with multiple spherical voids removed ("Swiss cheese"). The number of voids defines the class label.

Unlike the other FlooderDataset subclasses that download a compressed archive, this dataset overrides process() to generate samples and write them directly to processed_dir as .pt files. Split definitions are also generated and saved to processed_dir/splits.yaml.

Class semantics

Each class corresponds to a value k in ks, where k is the number of spherical voids carved out of the sampling volume.
Label y is the integer index into ks (i.e., ki from enumeration).

Generated file naming

Each sample is saved under a short SHA256-derived identifier computed from the generated point array bytes. This provides deterministic naming for a fixed RNG seed and generation implementation, but note that changes to sampling code, dtype, or ordering can change the resulting hash.

Splits

process() generates 10 random splits (keys 0..9), each containing trn, val, and tst partitions with proportions 72% / 8% / 20%, respectively, over the full dataset indices.

Notes

Because generation is performed inside process(), instantiation may be compute- and storage-intensive, depending on num_points and dataset size.
This class sets a fixed RNG seed (np.random.RandomState(42)) for split generation. The point generation itself depends on the behavior of generate_swiss_cheese_points (and any randomness inside it).

Initialize the Swiss cheese dataset generator.

Parameters:

Name	Type	Description	Default
`root`	`str`	Root directory where the dataset is stored.	required
`ks`	`list[int]`	List of void counts, one per class. Each `k` produces a distinct class corresponding to point clouds with `k` voids.	`[10, 20]`
`num_per_class`	`int`	Number of samples generated for each class. Total dataset size is `len(ks) * num_per_class`.	`500`
`num_points`	`int`	Number of points generated per sample point cloud.	`1000000`
`fixed_transform`	`Callable[[FlooderData], FlooderData] \| None`	Optional transform applied once per example during `_load()`.	`None`
`transform`	`Callable[[FlooderData], FlooderData] \| None`	Optional transform applied on-the-fly in `__getitem__`.	`None`

Notes

Split generation uses a fixed seed (42) via np.random.RandomState.
Generation and serialization are performed during process(), which is invoked during FlooderDataset construction if processed artifacts are missing.

folder_name `property`

folder_name: str

Name of the raw folder under raw_dir.

Returns:

Name	Type	Description
`str`	`str`	Folder name. Included for API compatibility; this dataset does not
	`str`	use extracted raw archives.

raw_file_names `property`

raw_file_names: list[str]

Raw-file requirements for download skipping.

This dataset is generated locally and does not require downloaded raw files, so this returns an empty list.

Returns:

Type	Description
`list[str]`	list[str]: Empty list.

process

process() -> None

Generate synthetic samples and write processed artifacts to disk.

This method generates

processed_dir/splits.yaml: A dict of 10 random splits with keys 0..9.
One .pt file per generated sample containing a FlooderData object.
processed_dir/_done: A sentinel file indicating processing completion.

Sample generation details

Points are generated inside an axis-aligned box with corners rect_min = [0,0,0] and rect_max = [5,5,5].
For each class k in ks, the generator creates num_per_class samples using generate_swiss_cheese_points(num_points, ..., k, ...).
Each sample is labeled with y = ki where ki is the index of k in ks.

Raises:

Type	Description
`OSError`	If the processed directory cannot be written.
`RuntimeError`	If `torch.save` fails.
`Exception`	Propagates exceptions from `generate_swiss_cheese_points`.

Datasets

flooder.datasets.datasets

BaseDataset

processed_dir property

processed_file_names property

processed_paths property

raw_dir property

raw_file_names property

raw_paths property

__getitem__

__iter__

__len__

download

get

index_select

indices

len

process

shuffle

CoralDataset

checksum property

file_id property

raw_file_names property

process_file

FlooderDataset

checksum property

file_id property

folder_name property

processed_file_names property

download

get

get_split_indices

len

process

process_file

unzip_file

validate

LargePointCloudDataset

checksum property

file_id property

folder_name property

raw_file_names property

uncompressed_file_names property

get

len

process

MCBDataset

checksum property

file_id property

folder_name property

raw_file_names property

process_file

ModelNet10Dataset

RocksDataset

process_file

SwisscheeseDataset

folder_name property

raw_file_names property

process

processed_dir `property`

processed_file_names `property`

processed_paths `property`

raw_dir `property`

raw_file_names `property`

raw_paths `property`

getitem

iter

len

checksum `property`

file_id `property`

raw_file_names `property`

checksum `property`

file_id `property`

folder_name `property`

processed_file_names `property`

checksum `property`

file_id `property`

folder_name `property`

raw_file_names `property`

uncompressed_file_names `property`

checksum `property`

file_id `property`

folder_name `property`

raw_file_names `property`

folder_name `property`

raw_file_names `property`