Skip to content

Datasets

flooder.datasets.datasets

Implementation of datasets used in the original Flooder paper.

Copyright (c) 2025 Paolo Pellizzoni, Florian Graf, Martin Uray, Stefan Huber and Roland Kwitt SPDX-License-Identifier: MIT

BaseDataset

BaseDataset(
    root: str,
    fixed_transform: Callable[[FlooderData], FlooderData]
    | None = None,
    transform: Callable[[FlooderData], FlooderData]
    | None = None,
)

Bases: Dataset

Base class for Flooder datasets with download/process/load lifecycle.

This class provides a dataset API inspired by torch_geometric.data.Dataset, including: - a standard directory layout (root/raw and root/processed); - a lifecycle executed at construction time: download, process, load; - integer indexing to return items, and advanced indexing to return a subset "view" of the dataset; - optional per-sample transformations.

Subclasses must implement the abstract properties/methods that specify file requirements and dataset-specific loading logic.

Attributes:

Name Type Description
root str

Root directory containing the dataset folders.

fixed_transform Callable[[FlooderData], FlooderData] | None

Optional transform applied once to each item during _load() (i.e., at dataset load time, before storing in memory).

transform Callable[[FlooderData], FlooderData] | None

Optional transform applied on-the-fly in __getitem__ for individual samples.

_indices Sequence[int] | None

If not None, defines a subset view over the underlying dataset indices.

Notes
  • The constructor triggers _download(), _process(), and _load(). This means instantiation may perform I/O and compute.
  • Advanced indexing (slice, sequences, boolean masks) returns a shallow-copied dataset object sharing the same underlying storage (whatever the subclass uses), but with _indices set.

Initialize a dataset and execute the download/process/load lifecycle.

Parameters:

Name Type Description Default
root str

Root directory where raw and processed data are stored.

required
fixed_transform Callable[[FlooderData], FlooderData] | None

Optional transform applied once to each item during _load().

None
transform Callable[[FlooderData], FlooderData] | None

Optional transform applied on-the-fly in __getitem__.

None
Notes

Instantiation may perform I/O and compute by calling _download(), _process(), and _load().

processed_dir property

processed_dir: str

Directory containing processed files.

Returns:

Name Type Description
str str

Path to <root>/processed.

processed_file_names property

processed_file_names: Union[str, List[str], Tuple[str, ...]]

Required processed files to consider the dataset processed.

Subclasses should return the file name(s) expected to exist in processed_dir. If all such files exist, _process() is skipped.

Returns:

Type Description
Union[str, List[str], Tuple[str, ...]]

Union[str, list[str], tuple[str, ...]]: File name(s) expected

Union[str, List[str], Tuple[str, ...]]

inside self.processed_dir.

Raises:

Type Description
NotImplementedError

If not implemented by a subclass.

processed_paths property

processed_paths: List[str]

Absolute paths to required processed files.

Returns:

Type Description
List[str]

list[str]: List of absolute paths for processed_file_names under

List[str]

processed_dir.

raw_dir property

raw_dir: str

Directory containing raw downloaded files.

Returns:

Name Type Description
str str

Path to <root>/raw.

raw_file_names property

raw_file_names: Union[str, List[str], Tuple[str, ...]]

Required raw files to consider the dataset downloaded.

Subclasses should return the file name(s) expected to exist in raw_dir. If all such files exist, _download() is skipped.

Returns:

Type Description
Union[str, List[str], Tuple[str, ...]]

Union[str, list[str], tuple[str, ...]]: File name(s) expected

Union[str, List[str], Tuple[str, ...]]

inside self.raw_dir.

Raises:

Type Description
NotImplementedError

If not implemented by a subclass.

raw_paths property

raw_paths: List[str]

Absolute paths to required raw files.

Returns:

Type Description
List[str]

list[str]: List of absolute paths for raw_file_names under

List[str]

raw_dir.

__getitem__

__getitem__(
    idx: Union[int, integer, IndexType],
) -> "FlooderData | BaseDataset"

Get an item or a subset of the dataset.

Behavior depends on the type of idx:

  • If idx is an integer (Python int, np.integer, 0-dim Tensor, or scalar np.ndarray), returns a single FlooderData object corresponding to the view index idx.
  • Otherwise, returns a subset view of the dataset created via index_select(idx).

If transform is set, it is applied on-the-fly to single-item access.

Parameters:

Name Type Description Default
idx int | integer | slice | Tensor | ndarray | Sequence

Index or indices selecting items.

required

Returns:

Type Description
'FlooderData | BaseDataset'

FlooderData | BaseDataset: A single data object if idx is scalar-like, otherwise a BaseDataset subset view.

Raises:

Type Description
IndexError

If idx type is unsupported (delegated to index_select).

__iter__

__iter__() -> Iterator[FlooderData]

Iterate over items in the current dataset view.

Yields:

Name Type Description
FlooderData FlooderData

Items in order from 0 to len(self) - 1, with

FlooderData

transform applied if configured.

__len__

__len__() -> int

Return the number of examples in the current dataset view.

For a full dataset this equals self.len(). For a subset view this equals len(self._indices).

Returns:

Name Type Description
int int

Number of examples exposed by this dataset instance.

download

download() -> None

Download the dataset into raw_dir.

Subclasses must implement the dataset-specific download logic. This method is called by _download() only if the required files in raw_paths are not present.

Raises:

Type Description
NotImplementedError

If not implemented by a subclass.

get

get(idx: int) -> FlooderData

Return the data object at a given global index.

idx refers to the underlying dataset index, not the subset-view index. The subset mapping is handled by __getitem__ via indices().

Parameters:

Name Type Description Default
idx int

Global index into the underlying dataset storage.

required

Returns:

Name Type Description
FlooderData FlooderData

The data object at the given index.

Raises:

Type Description
NotImplementedError

If not implemented by a subclass.

index_select

index_select(idx: IndexType) -> 'BaseDataset'

Create a subset view of the dataset from specified indices.

Supported index types: - slice: includes support for float boundaries, e.g. dataset[:0.9], interpreted as a fraction of the current view length. - torch.Tensor of dtype long: treated as integer indices. - torch.Tensor of dtype bool: treated as a boolean mask. - np.ndarray of dtype int64: treated as integer indices. - np.ndarray of dtype bool: treated as a boolean mask. - Sequence (excluding str): treated as a list of integer indices.

The returned dataset is a shallow copy of self with _indices set to map view indices to global indices.

Parameters:

Name Type Description Default
idx slice | Sequence[int] | Tensor | ndarray

Indices specifying the subset.

required

Returns:

Name Type Description
BaseDataset 'BaseDataset'

A subset view of the dataset.

Raises:

Type Description
IndexError

If idx is not one of the supported types or has an unsupported dtype.

indices

indices() -> Sequence

Return the active index mapping for this dataset view.

For a full dataset (no subset), this is range(self.len()). For a subset view created via index_select, this is the stored _indices sequence.

Returns:

Type Description
Sequence

Sequence[int]: Index mapping from view indices to global indices.

len

len() -> int

Return the number of items in the full dataset.

This method is analogous to torch_geometric.data.Dataset.len(). It should return the total number of items in the underlying dataset, not the size of a subset view created via index_select.

Returns:

Name Type Description
int int

Total number of data objects stored by the dataset.

Raises:

Type Description
NotImplementedError

If not implemented by a subclass.

process

process() -> None

Process raw files into processed_dir.

Subclasses must implement the dataset-specific processing logic. This method is called by _process() only if the required files in processed_paths are not present.

Raises:

Type Description
NotImplementedError

If not implemented by a subclass.

shuffle

shuffle(
    return_perm: bool = False,
) -> "BaseDataset | Tuple[BaseDataset, Tensor]"

Return a shuffled subset view of the dataset.

This method generates a random permutation of the current dataset view and returns a subset view with that ordering.

Parameters:

Name Type Description Default
return_perm bool

If True, also return the permutation tensor.

False

Returns:

Type Description
'BaseDataset | Tuple[BaseDataset, Tensor]'

BaseDataset | tuple[BaseDataset, torch.Tensor]: If return_perm is False, returns the shuffled dataset view. If True, returns (dataset, perm) where perm is a 1D long tensor of indices into the current view.

CoralDataset

CoralDataset(
    root: str,
    fixed_transform: Callable[[FlooderData], FlooderData]
    | None = None,
    transform: Callable[[FlooderData], FlooderData]
    | None = None,
)

Bases: FlooderDataset

Coral point-cloud dataset used in the Flooder paper.

This dataset consists of 81 point clouds, each comprising 1 million points uniformly sampled from surface meshes of coral specimens provided by the Smithsonian 3D Digitization program (https://3d.si.edu/corals). Labels correspond to the coral's genus, with 31 Acroporidae samples (label 0) and 52 Poritidae samples (label 1).

The dataset is distributed as a compressed archive (corals.tar.zst) hosted on Google Drive. The archive is downloaded, validated via SHA256, extracted into raw_dir/<folder_name>/, and processed into per-sample .pt files stored in processed_dir.

Each raw sample is stored as a .npy array that is loaded and normalized by dividing by 32767, and cast to float32. Labels are read from the dataset metadata (meta.yaml) under ydata['data'][<filename>]['label'].

The processed sample is represented as
  • x: torch.FloatTensor containing the point cloud
  • y: integer class label
  • name: sample identifier derived from the file stem

Directory structure (expected after extraction): raw_dir/corals/ meta.yaml splits.yaml *.npy

See Also

FlooderDataset: Implements the download/process/load lifecycle.

checksum property

checksum: str

Expected SHA256 checksum of the downloaded archive.

Returns:

Name Type Description
str str

Lowercase hex-encoded SHA256 digest for corals.tar.zst.

file_id property

file_id: str

Google Drive file id for the dataset archive.

Returns:

Name Type Description
str str

Google Drive file id used to construct the download URL.

raw_file_names property

raw_file_names: list[str]

Name of the extracted raw folder under raw_dir.

Returns:

Name Type Description
str list[str]

Folder name containing raw dataset files after extraction.

process_file

process_file(file: Path, ydata: dict) -> None

Convert a raw .npy file into a FlooderData example.

Loads the point cloud from file using numpy.load, normalizes values by dividing by 32767, casts to float32, and converts to a PyTorch tensor. The class label is read from metadata.

Parameters:

Name Type Description Default
file Path

Path to the raw .npy file to process.

required
ydata dict

Parsed YAML metadata from meta.yaml. Must contain an entry ydata['data'][file.name]['label'].

required

Returns:

Name Type Description
FlooderData None

Processed example with fields (x, y, name).

Raises:

Type Description
KeyError

If the expected label entry is missing from ydata.

OSError

If the .npy file cannot be read.

ValueError

If the .npy content cannot be converted to float32.

FlooderDataset

FlooderDataset(
    root: str,
    fixed_transform: Callable[[FlooderData], FlooderData]
    | None = None,
    transform: Callable[[FlooderData], FlooderData]
    | None = None,
)

Bases: BaseDataset

Base class for Flooder paper datasets distributed as compressed archives.

This dataset class implements a standard pipeline:

1) Download a .tar.zst archive from Google Drive (via gdown), identified by file_id, and validate it with a SHA256 checksum. 2) Decompress and extract the archive into raw_dir/<folder_name>/. 3) Read dataset metadata (meta.yaml) and split definitions (splits.yaml) from the extracted raw folder. 4) Convert each raw .npy file into a FlooderData-like object via process_file(...), and store it as a .pt file in processed_dir. 5) Load all .pt files into memory (self.data) in _load(), optionally applying fixed_transform once per sample at load time.

Subclasses are expected to define
  • file_id: Google Drive file id
  • checksum: expected SHA256 checksum of the downloaded archive
  • folder_name: name of the extracted folder under raw_dir
  • raw_file_names: name(s) of the downloaded raw archive file(s)
  • process_file(...): conversion logic from a .npy file and metadata

Attributes:

Name Type Description
data list[FlooderData]

In-memory list of processed examples loaded from .pt files. Populated by _load().

splits dict

Split definitions loaded from processed_dir/splits.yaml. The structure depends on the dataset, but is typically a mapping from split identifier (e.g., fold index) to dicts containing keys like 'trn', 'val', 'tst' with integer indices.

classes list[int]

Sorted list of unique class labels observed across the dataset (computed after loading).

num_classes int

Number of unique classes.

checksum property

checksum: str

Expected SHA256 checksum of the downloaded archive.

The checksum is used by validate(...) after download. If the computed SHA256 does not match, a warning is emitted.

Returns:

Name Type Description
str str

Lowercase hex-encoded SHA256 digest of the expected archive.

Raises:

Type Description
NotImplementedError

If not implemented by a subclass.

file_id property

file_id: str

Google Drive file id for the dataset archive.

Subclasses must provide the id used to construct the download URL: https://drive.google.com/uc?id=<file_id>.

Returns:

Name Type Description
str str

The Google Drive file id.

Raises:

Type Description
NotImplementedError

If not implemented by a subclass.

folder_name property

folder_name: str

Name of the extracted folder under raw_dir.

After extraction, the expected raw folder is raw_dir/<folder_name>/ containing meta.yaml, splits.yaml, and the .npy files.

Returns:

Name Type Description
str str

Folder name within raw_dir.

Raises:

Type Description
NotImplementedError

If not implemented by a subclass.

processed_file_names property

processed_file_names: list[str]

Processed-file sentinel list for determining whether processing is done.

The default convention for Flooder datasets is
  • _done: an empty file indicating processing completion
  • splits.yaml: split definitions copied to the processed directory

Returns:

Type Description
list[str]

list[str]: List of required processed file names.

download

download() -> None

Download the dataset archive from Google Drive into raw_dir.

Constructs a Google Drive download URL from file_id and downloads into raw_dir/<raw_file_names[0]> using gdown. After downloading, calls validate(...) to check integrity.

Raises:

Type Description
IndexError

If raw_file_names is empty.

OSError

If the destination file cannot be written.

Exception

Propagates errors from gdown.download(...).

get

get(idx: int) -> FlooderData

Return the in-memory data object at the given global index.

This implementation assumes _load() has populated self.data with objects saved in processed_dir as .pt files.

Parameters:

Name Type Description Default
idx int

Global index into self.data.

required

Returns:

Name Type Description
FlooderData FlooderData

The data item at idx.

get_split_indices

get_split_indices(splits_data) -> dict

Extract split indices from raw splits.yaml content.

The default behavior expects the raw splits.yaml to contain a top-level key "splits" holding the split definitions.

Subclasses may override this method if their splits.yaml uses a different schema.

Parameters:

Name Type Description Default
splits_data dict

Parsed YAML content from splits.yaml.

required

Returns:

Name Type Description
Any dict

The split indices structure to be saved into

dict

processed_dir/splits.yaml. Typically a dict mapping fold id to split

dict

dicts ('trn', 'val', 'tst'), but may vary by dataset.

len

len() -> int

Return the number of examples in the full dataset.

Returns:

Name Type Description
int int

Total number of examples, equal to len(self.data) after _load().

process

process() -> None

Process the extracted raw dataset into serialized .pt files.

Processing performs the following steps:

1) Ensure the archive has been extracted into raw_dir/<folder_name>/. 2) Load metadata from meta.yaml and split definitions from splits.yaml. 3) Save extracted split indices into processed_dir/splits.yaml. 4) Iterate over all .npy files in the extracted folder, sorted by name. 5) For each .npy, call process_file(file, ydata) and save the returned object as <stem>.pt in processed_dir. 6) Create the _done sentinel file in processed_dir.

Raises:

Type Description
FileNotFoundError

If required raw files (meta.yaml, splits.yaml, or .npy files) are missing.

YAMLError

If YAML parsing fails.

OSError

For I/O errors reading raw files or writing processed files.

RuntimeError

If torch.save fails for a produced object.

process_file

process_file(file: Path, ydata: dict) -> None

Convert a raw .npy file into a FlooderData-like object.

Subclasses must implement the dataset-specific logic for reading file (typically via numpy.load) and for producing an instance of FlooderData (or a subclass like FlooderRocksData).

Parameters:

Name Type Description Default
file Path

Path to a .npy file inside the extracted raw folder.

required
ydata dict

Metadata loaded from meta.yaml. The structure is dataset- dependent but typically contains labels and other targets keyed by file name.

required

Returns:

Name Type Description
FlooderData None

Processed data object to be saved as a .pt file.

Raises:

Type Description
NotImplementedError

If not implemented by a subclass.

unzip_file

unzip_file() -> None

Decompress and extract the dataset archive into raw_dir.

This method reads the first file in raw_paths as a .tar.zst archive, decompresses it using zstandard, and extracts it using tarfile.

Extraction behavior depends on Python version
  • Python >= 3.12: uses tar.extractall(..., filter='data') to apply tarfile's safety filter.
  • Older versions: falls back to tar.extractall(...).

Raises:

Type Description
FileNotFoundError

If raw_paths[0] does not exist.

TarError

If the archive is invalid or cannot be read.

ZstdError

If decompression fails.

OSError

For I/O errors during reading or extraction.

Security

Be careful extracting archives from untrusted sources. While Python 3.12's filter='data' mitigates some risks, older versions extract without filtering.

validate

validate(file_path: Path) -> None

Validate a downloaded archive against the expected SHA256 checksum.

Computes the SHA256 digest of file_path and compares it to self.checksum. If they do not match, emits a UserWarning.

Parameters:

Name Type Description Default
file_path Path | str

Path to the downloaded archive.

required

Warns:

Type Description
UserWarning

If the computed checksum differs from the expected checksum.

Raises:

Type Description
FileNotFoundError

If file_path does not exist.

OSError

For I/O errors reading the file.

LargePointCloudDataset

LargePointCloudDataset(
    root: str,
    fixed_transform: Callable[[FlooderData], FlooderData]
    | None = None,
    transform: Callable[[FlooderData], FlooderData]
    | None = None,
)

Bases: FlooderDataset

LargePointCloudDataset dataset with large-scale point-clouds used in the Flooder paper.

This dataset contains two point clouds with more than 10M points each, distributed as a compressed .tar.zst archive and hosted on Google Drive. The archive is downloaded, validated via SHA256, extracted into raw_dir/<folder_name>/, and processed into per-sample .pt files.

The processed sample are stored as a LargePointCloudData dataclass with the following attributes
  • x: torch.FloatTensor of point coordinates
  • name: sample identifier
  • description: brief description of the point cloud
Expected extracted raw directory structure

raw/large/ meta.yaml coral.pt virus.pt

See Also

FlooderDataset: Implements the shared download, processing, and loading pipeline.

checksum property

checksum: str

Expected SHA256 checksum of the downloaded archive.

Returns:

Name Type Description
str str

Lowercase hex-encoded SHA256 digest for large.tar.zst.

file_id property

file_id: str

Google Drive file id for the LargePointCloudDataset dataset archive.

Returns:

Name Type Description
str str

Google Drive file id used to construct the download URL.

folder_name property

folder_name: str

Name of the extracted raw folder under raw_dir.

Returns:

Name Type Description
str str

Folder name containing the extracted large dataset files.

raw_file_names property

raw_file_names: list[str]

Raw archive file name(s) expected in raw_dir.

Returns:

Type Description
list[str]

list[str]: List containing the dataset archive file name.

uncompressed_file_names property

uncompressed_file_names: list[str]

Uncompressed file name(s) expected in raw_dir.

Returns:

Type Description
list[str]

list[str]: List containing the uncompressed file names.

get

get(idx) -> LargePointCloudData

Return the data object at a given index (either 0 or 1).

Parameters:

Name Type Description Default
idx int

Index to access.

required

Returns:

Name Type Description
LargePointCloudData LargePointCloudData

The data object at the given index.

len

len() -> int

Return the number of items in the full dataset.

process

process() -> None

Extract the raw dataset.

MCBDataset

MCBDataset(
    root: str,
    fixed_transform: Callable[[FlooderData], FlooderData]
    | None = None,
    transform: Callable[[FlooderData], FlooderData]
    | None = None,
)

Bases: FlooderDataset

MCB point-cloud dataset used in the Flooder paper.

This dataset consists of 1745 point clouds, each comprising 1 million points uniformly sampled from surface meshes from a subset of the MCB dataset (A large-scale annotated mechanical components benchmark for classification and retrieval tasks with deep neural networks, ECCV, 2020) available at https://github.com/stnoah1/mcb.

The dataset is distributed as a compressed .tar.zst archive hosted on Google Drive. The archive is downloaded, validated using a SHA256 checksum, extracted into raw_dir/<folder_name>/, and processed into per-sample .pt files stored in processed_dir.

Each raw sample is stored as a .npy array containing quantized point coordinates. During processing, coordinates are normalized by dividing by 32767 and cast to float32.

The processed sample representation is
  • x: torch.FloatTensor of normalized point coordinates
  • y: integer class label
  • name: sample identifier derived from the file stem
Expected extracted raw directory structure

raw_dir/mcb/ meta.yaml splits.yaml *.npy

See Also

FlooderDataset: Implements the shared download, processing, and loading pipeline.

checksum property

checksum: str

Expected SHA256 checksum of the downloaded archive.

Returns:

Name Type Description
str str

Lowercase hex-encoded SHA256 digest for mcb.tar.zst.

file_id property

file_id: str

Google Drive file id for the MCB dataset archive.

Returns:

Name Type Description
str str

Google Drive file id used to construct the download URL.

folder_name property

folder_name: str

Name of the extracted raw folder under raw_dir.

Returns:

Name Type Description
str str

Folder name containing the extracted MCB dataset files.

raw_file_names property

raw_file_names: list[str]

Raw archive file name(s) expected in raw_dir.

Returns:

Type Description
list[str]

list[str]: List containing the dataset archive file name.

process_file

process_file(file: Path, ydata: dict) -> FlooderData

Convert a raw .npy file into a FlooderData example.

Loads the raw point cloud from file, normalizes coordinates by dividing by 32767, casts to float32, and converts to a PyTorch tensor. The class label is read from the dataset metadata.

Parameters:

Name Type Description Default
file Path

Path to the raw .npy file inside the extracted MCB dataset folder.

required
ydata dict

Parsed YAML metadata from meta.yaml. Must contain an entry ydata['data'][file.name]['label'].

required

Returns:

Name Type Description
FlooderData FlooderData

Processed example with fields (x, y, name).

Raises:

Type Description
KeyError

If the label entry for file.name is missing in ydata.

OSError

If the .npy file cannot be read.

ValueError

If the loaded array cannot be converted to float32.

ModelNet10Dataset

ModelNet10Dataset(
    root: str,
    fixed_transform: Callable[[FlooderData], FlooderData]
    | None = None,
    transform: Callable[[FlooderData], FlooderData]
    | None = None,
)

Bases: FlooderDataset

ModelNet10 point-cloud dataset (250k points) used in the Flooder paper.

This dataset consists of 4899 point clouds, each comprising 250k points uniformly sampled from surface meshes from the ModelNet10 dataset (Wu et al., 3D ShapeNets: A Deep Representation for Volumetric Shapes, CVPR 2015) available at https://modelnet.cs.princeton.edu/.

The dataset is distributed as a compressed .tar.zst archive hosted on Google Drive. The archive is downloaded, validated using a SHA256 checksum, extracted into raw_dir/<folder_name>/, and processed into per-sample .pt files stored in processed_dir.

Each raw sample is stored as a .npy array containing quantized point coordinates. During processing, coordinates are normalized by dividing by 32767 and cast to float32.

The processed sample representation is
  • x: torch.FloatTensor of normalized point coordinates
  • y: integer class label in [0, 9]
  • name: sample identifier derived from the file stem
Expected extracted raw directory structure

raw_dir/modelnet10_250k/ meta.yaml splits.yaml *.npy

See Also

FlooderDataset: Implements the shared download, processing, and loading pipeline.

RocksDataset

RocksDataset(
    root: str,
    fixed_transform: Callable[[FlooderData], FlooderData]
    | None = None,
    transform: Callable[[FlooderData], FlooderData]
    | None = None,
)

Bases: FlooderDataset

Rock voxel dataset converted to point clouds with geometric targets.

This synthetic dataset consists of 1000 3D binary voxel grids representing rock samples from two classes. The voxel grids are produced by the PoreSpy library (https://porespy.org/) with classes corresponding to the generation method, fractal noise and blobs, each with 500 samples.

The dataset is distributed as a compressed archive (rocks.tar.zst) hosted on Google Drive. During processing, each voxel grid is converted into a set of 3D points by extracting the coordinates of occupied voxels and adding small random jitter to break lattice the lattice structure.

In addition to the class label, each sample includes continuous targets such as surface area and volume.

Processed sample representation
  • x: torch.FloatTensor of shape (N, 3) containing point coordinates
  • y: integer class label
  • surface: float-valued surface area target
  • volume: float-valued volume target
  • name: sample identifier derived from the file stem
Expected extracted raw directory structure

raw_dir/rocks/ meta.yaml splits.yaml *.npy

See Also

FlooderDataset: Implements the shared download, processing, and loading pipeline.

process_file

process_file(file: Path, ydata: dict) -> FlooderRocksData

Convert a raw voxel .npy file into a FlooderRocksData example.

Processing steps

1) Load the bit-packed voxel array from file. 2) Unpack bits into a boolean array of shape (256, 256, 256). 3) Extract the indices of occupied voxels using np.where. 4) Convert voxel indices to float coordinates and add small random jitter to avoid degenerate lattice structure. 5) Attach label and continuous targets from metadata.

Parameters:

Name Type Description Default
file Path

Path to the raw .npy voxel file.

required
ydata dict

Parsed YAML metadata from meta.yaml. Must contain entries for label, target (surface), and volume under ydata['data'][file.name].

required

Returns:

Name Type Description
FlooderRocksData FlooderRocksData

Processed example with fields

FlooderRocksData

(x, y, surface, volume, name).

Raises:

Type Description
KeyError

If required metadata entries are missing.

ValueError

If the unpacked voxel array cannot be reshaped to (256, 256, 256).

OSError

If the .npy file cannot be read.

SwisscheeseDataset

SwisscheeseDataset(
    root: str,
    ks: list[int] = [10, 20],
    num_per_class: int = 500,
    num_points: int = 1000000,
    fixed_transform: Callable[[FlooderData], FlooderData]
    | None = None,
    transform: Callable[[FlooderData], FlooderData]
    | None = None,
)

Bases: FlooderDataset

Synthetic "Swiss cheese" point-cloud dataset used in the Flooder paper.

This dataset is generated procedurally (no download). Each sample consists of points uniformly sampled from a 3D axis-aligned box with multiple spherical voids removed ("Swiss cheese"). The number of voids defines the class label.

Unlike the other FlooderDataset subclasses that download a compressed archive, this dataset overrides process() to generate samples and write them directly to processed_dir as .pt files. Split definitions are also generated and saved to processed_dir/splits.yaml.

Class semantics
  • Each class corresponds to a value k in ks, where k is the number of spherical voids carved out of the sampling volume.
  • Label y is the integer index into ks (i.e., ki from enumeration).
Generated file naming

Each sample is saved under a short SHA256-derived identifier computed from the generated point array bytes. This provides deterministic naming for a fixed RNG seed and generation implementation, but note that changes to sampling code, dtype, or ordering can change the resulting hash.

Splits

process() generates 10 random splits (keys 0..9), each containing trn, val, and tst partitions with proportions 72% / 8% / 20%, respectively, over the full dataset indices.

Notes
  • Because generation is performed inside process(), instantiation may be compute- and storage-intensive, depending on num_points and dataset size.
  • This class sets a fixed RNG seed (np.random.RandomState(42)) for split generation. The point generation itself depends on the behavior of generate_swiss_cheese_points (and any randomness inside it).

Initialize the Swiss cheese dataset generator.

Parameters:

Name Type Description Default
root str

Root directory where the dataset is stored.

required
ks list[int]

List of void counts, one per class. Each k produces a distinct class corresponding to point clouds with k voids.

[10, 20]
num_per_class int

Number of samples generated for each class. Total dataset size is len(ks) * num_per_class.

500
num_points int

Number of points generated per sample point cloud.

1000000
fixed_transform Callable[[FlooderData], FlooderData] | None

Optional transform applied once per example during _load().

None
transform Callable[[FlooderData], FlooderData] | None

Optional transform applied on-the-fly in __getitem__.

None
Notes
  • Split generation uses a fixed seed (42) via np.random.RandomState.
  • Generation and serialization are performed during process(), which is invoked during FlooderDataset construction if processed artifacts are missing.

folder_name property

folder_name: str

Name of the raw folder under raw_dir.

Returns:

Name Type Description
str str

Folder name. Included for API compatibility; this dataset does not

str

use extracted raw archives.

raw_file_names property

raw_file_names: list[str]

Raw-file requirements for download skipping.

This dataset is generated locally and does not require downloaded raw files, so this returns an empty list.

Returns:

Type Description
list[str]

list[str]: Empty list.

process

process() -> None

Generate synthetic samples and write processed artifacts to disk.

This method generates
  • processed_dir/splits.yaml: A dict of 10 random splits with keys 0..9.
  • One .pt file per generated sample containing a FlooderData object.
  • processed_dir/_done: A sentinel file indicating processing completion.
Sample generation details
  • Points are generated inside an axis-aligned box with corners rect_min = [0,0,0] and rect_max = [5,5,5].
  • For each class k in ks, the generator creates num_per_class samples using generate_swiss_cheese_points(num_points, ..., k, ...).
  • Each sample is labeled with y = ki where ki is the index of k in ks.

Raises:

Type Description
OSError

If the processed directory cannot be written.

RuntimeError

If torch.save fails.

Exception

Propagates exceptions from generate_swiss_cheese_points.