Datasets
flooder.datasets.datasets
Implementation of datasets used in the original Flooder paper.
Copyright (c) 2025 Paolo Pellizzoni, Florian Graf, Martin Uray, Stefan Huber and Roland Kwitt SPDX-License-Identifier: MIT
BaseDataset
BaseDataset(
root: str,
fixed_transform: Callable[[FlooderData], FlooderData]
| None = None,
transform: Callable[[FlooderData], FlooderData]
| None = None,
)
Bases: Dataset
Base class for Flooder datasets with download/process/load lifecycle.
This class provides a dataset API inspired by
torch_geometric.data.Dataset, including:
- a standard directory layout (root/raw and root/processed);
- a lifecycle executed at construction time: download, process, load;
- integer indexing to return items, and advanced indexing to return
a subset "view" of the dataset;
- optional per-sample transformations.
Subclasses must implement the abstract properties/methods that specify file requirements and dataset-specific loading logic.
Attributes:
| Name | Type | Description |
|---|---|---|
root |
str
|
Root directory containing the dataset folders. |
fixed_transform |
Callable[[FlooderData], FlooderData] | None
|
Optional transform applied once to each item during |
transform |
Callable[[FlooderData], FlooderData] | None
|
Optional transform applied on-the-fly in |
_indices |
Sequence[int] | None
|
If not None, defines a subset view over the underlying dataset indices. |
Notes
- The constructor triggers
_download(),_process(), and_load(). This means instantiation may perform I/O and compute. - Advanced indexing (
slice, sequences, boolean masks) returns a shallow-copied dataset object sharing the same underlying storage (whatever the subclass uses), but with_indicesset.
Initialize a dataset and execute the download/process/load lifecycle.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
root
|
str
|
Root directory where raw and processed data are stored. |
required |
fixed_transform
|
Callable[[FlooderData], FlooderData] | None
|
Optional transform applied once to each item during |
None
|
transform
|
Callable[[FlooderData], FlooderData] | None
|
Optional transform applied on-the-fly in |
None
|
Notes
Instantiation may perform I/O and compute by calling _download(),
_process(), and _load().
processed_dir
property
Directory containing processed files.
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
Path to |
processed_file_names
property
Required processed files to consider the dataset processed.
Subclasses should return the file name(s) expected to exist in
processed_dir. If all such files exist, _process() is skipped.
Returns:
| Type | Description |
|---|---|
Union[str, List[str], Tuple[str, ...]]
|
Union[str, list[str], tuple[str, ...]]: File name(s) expected |
Union[str, List[str], Tuple[str, ...]]
|
inside |
Raises:
| Type | Description |
|---|---|
NotImplementedError
|
If not implemented by a subclass. |
processed_paths
property
Absolute paths to required processed files.
Returns:
| Type | Description |
|---|---|
List[str]
|
list[str]: List of absolute paths for |
List[str]
|
|
raw_dir
property
Directory containing raw downloaded files.
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
Path to |
raw_file_names
property
Required raw files to consider the dataset downloaded.
Subclasses should return the file name(s) expected to exist in
raw_dir. If all such files exist, _download() is skipped.
Returns:
| Type | Description |
|---|---|
Union[str, List[str], Tuple[str, ...]]
|
Union[str, list[str], tuple[str, ...]]: File name(s) expected |
Union[str, List[str], Tuple[str, ...]]
|
inside |
Raises:
| Type | Description |
|---|---|
NotImplementedError
|
If not implemented by a subclass. |
raw_paths
property
Absolute paths to required raw files.
Returns:
| Type | Description |
|---|---|
List[str]
|
list[str]: List of absolute paths for |
List[str]
|
|
__getitem__
Get an item or a subset of the dataset.
Behavior depends on the type of idx:
- If
idxis an integer (Pythonint,np.integer, 0-dimTensor, or scalarnp.ndarray), returns a singleFlooderDataobject corresponding to the view indexidx. - Otherwise, returns a subset view of the dataset created via
index_select(idx).
If transform is set, it is applied on-the-fly to single-item access.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
idx
|
int | integer | slice | Tensor | ndarray | Sequence
|
Index or indices selecting items. |
required |
Returns:
| Type | Description |
|---|---|
'FlooderData | BaseDataset'
|
FlooderData | BaseDataset:
A single data object if |
Raises:
| Type | Description |
|---|---|
IndexError
|
If |
__iter__
Iterate over items in the current dataset view.
Yields:
| Name | Type | Description |
|---|---|---|
FlooderData |
FlooderData
|
Items in order from |
FlooderData
|
|
__len__
Return the number of examples in the current dataset view.
For a full dataset this equals self.len(). For a subset view this
equals len(self._indices).
Returns:
| Name | Type | Description |
|---|---|---|
int |
int
|
Number of examples exposed by this dataset instance. |
download
Download the dataset into raw_dir.
Subclasses must implement the dataset-specific download logic.
This method is called by _download() only if the required files
in raw_paths are not present.
Raises:
| Type | Description |
|---|---|
NotImplementedError
|
If not implemented by a subclass. |
get
Return the data object at a given global index.
idx refers to the underlying dataset index, not the subset-view
index. The subset mapping is handled by __getitem__ via indices().
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
idx
|
int
|
Global index into the underlying dataset storage. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
FlooderData |
FlooderData
|
The data object at the given index. |
Raises:
| Type | Description |
|---|---|
NotImplementedError
|
If not implemented by a subclass. |
index_select
Create a subset view of the dataset from specified indices.
Supported index types:
- slice: includes support for float boundaries, e.g. dataset[:0.9],
interpreted as a fraction of the current view length.
- torch.Tensor of dtype long: treated as integer indices.
- torch.Tensor of dtype bool: treated as a boolean mask.
- np.ndarray of dtype int64: treated as integer indices.
- np.ndarray of dtype bool: treated as a boolean mask.
- Sequence (excluding str): treated as a list of integer indices.
The returned dataset is a shallow copy of self with _indices set
to map view indices to global indices.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
idx
|
slice | Sequence[int] | Tensor | ndarray
|
Indices specifying the subset. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
BaseDataset |
'BaseDataset'
|
A subset view of the dataset. |
Raises:
| Type | Description |
|---|---|
IndexError
|
If |
indices
Return the active index mapping for this dataset view.
For a full dataset (no subset), this is range(self.len()).
For a subset view created via index_select, this is the stored
_indices sequence.
Returns:
| Type | Description |
|---|---|
Sequence
|
Sequence[int]: Index mapping from view indices to global indices. |
len
Return the number of items in the full dataset.
This method is analogous to torch_geometric.data.Dataset.len().
It should return the total number of items in the underlying dataset,
not the size of a subset view created via index_select.
Returns:
| Name | Type | Description |
|---|---|---|
int |
int
|
Total number of data objects stored by the dataset. |
Raises:
| Type | Description |
|---|---|
NotImplementedError
|
If not implemented by a subclass. |
process
Process raw files into processed_dir.
Subclasses must implement the dataset-specific processing logic.
This method is called by _process() only if the required files
in processed_paths are not present.
Raises:
| Type | Description |
|---|---|
NotImplementedError
|
If not implemented by a subclass. |
shuffle
Return a shuffled subset view of the dataset.
This method generates a random permutation of the current dataset view and returns a subset view with that ordering.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
return_perm
|
bool
|
If True, also return the permutation tensor. |
False
|
Returns:
| Type | Description |
|---|---|
'BaseDataset | Tuple[BaseDataset, Tensor]'
|
BaseDataset | tuple[BaseDataset, torch.Tensor]:
If |
CoralDataset
CoralDataset(
root: str,
fixed_transform: Callable[[FlooderData], FlooderData]
| None = None,
transform: Callable[[FlooderData], FlooderData]
| None = None,
)
Bases: FlooderDataset
Coral point-cloud dataset used in the Flooder paper.
This dataset consists of 81 point clouds, each comprising 1 million points uniformly sampled from surface meshes of coral specimens provided by the Smithsonian 3D Digitization program (https://3d.si.edu/corals). Labels correspond to the coral's genus, with 31 Acroporidae samples (label 0) and 52 Poritidae samples (label 1).
The dataset is distributed as a compressed archive (corals.tar.zst)
hosted on Google Drive. The archive is downloaded, validated via SHA256,
extracted into raw_dir/<folder_name>/, and processed into per-sample
.pt files stored in processed_dir.
Each raw sample is stored as a .npy array that is loaded and normalized
by dividing by 32767, and cast to float32. Labels are read from the
dataset metadata (meta.yaml) under ydata['data'][<filename>]['label'].
The processed sample is represented as
x:torch.FloatTensorcontaining the point cloudy: integer class labelname: sample identifier derived from the file stem
Directory structure (expected after extraction): raw_dir/corals/ meta.yaml splits.yaml *.npy
See Also
FlooderDataset: Implements the download/process/load lifecycle.
checksum
property
Expected SHA256 checksum of the downloaded archive.
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
Lowercase hex-encoded SHA256 digest for |
file_id
property
Google Drive file id for the dataset archive.
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
Google Drive file id used to construct the download URL. |
raw_file_names
property
Name of the extracted raw folder under raw_dir.
Returns:
| Name | Type | Description |
|---|---|---|
str |
list[str]
|
Folder name containing raw dataset files after extraction. |
process_file
Convert a raw .npy file into a FlooderData example.
Loads the point cloud from file using numpy.load, normalizes values
by dividing by 32767, casts to float32, and converts to a PyTorch
tensor. The class label is read from metadata.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file
|
Path
|
Path to the raw |
required |
ydata
|
dict
|
Parsed YAML metadata from |
required |
Returns:
| Name | Type | Description |
|---|---|---|
FlooderData |
None
|
Processed example with fields |
Raises:
| Type | Description |
|---|---|
KeyError
|
If the expected label entry is missing from |
OSError
|
If the |
ValueError
|
If the |
FlooderDataset
FlooderDataset(
root: str,
fixed_transform: Callable[[FlooderData], FlooderData]
| None = None,
transform: Callable[[FlooderData], FlooderData]
| None = None,
)
Bases: BaseDataset
Base class for Flooder paper datasets distributed as compressed archives.
This dataset class implements a standard pipeline:
1) Download a .tar.zst archive from Google Drive (via gdown),
identified by file_id, and validate it with a SHA256 checksum.
2) Decompress and extract the archive into raw_dir/<folder_name>/.
3) Read dataset metadata (meta.yaml) and split definitions (splits.yaml)
from the extracted raw folder.
4) Convert each raw .npy file into a FlooderData-like object via
process_file(...), and store it as a .pt file in processed_dir.
5) Load all .pt files into memory (self.data) in _load(), optionally
applying fixed_transform once per sample at load time.
Subclasses are expected to define
file_id: Google Drive file idchecksum: expected SHA256 checksum of the downloaded archivefolder_name: name of the extracted folder underraw_dirraw_file_names: name(s) of the downloaded raw archive file(s)process_file(...): conversion logic from a.npyfile and metadata
Attributes:
| Name | Type | Description |
|---|---|---|
data |
list[FlooderData]
|
In-memory list of processed examples loaded from
|
splits |
dict
|
Split definitions loaded from |
classes |
list[int]
|
Sorted list of unique class labels observed across the dataset (computed after loading). |
num_classes |
int
|
Number of unique classes. |
checksum
property
Expected SHA256 checksum of the downloaded archive.
The checksum is used by validate(...) after download. If the computed
SHA256 does not match, a warning is emitted.
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
Lowercase hex-encoded SHA256 digest of the expected archive. |
Raises:
| Type | Description |
|---|---|
NotImplementedError
|
If not implemented by a subclass. |
file_id
property
Google Drive file id for the dataset archive.
Subclasses must provide the id used to construct the download URL:
https://drive.google.com/uc?id=<file_id>.
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
The Google Drive file id. |
Raises:
| Type | Description |
|---|---|
NotImplementedError
|
If not implemented by a subclass. |
folder_name
property
Name of the extracted folder under raw_dir.
After extraction, the expected raw folder is raw_dir/<folder_name>/
containing meta.yaml, splits.yaml, and the .npy files.
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
Folder name within |
Raises:
| Type | Description |
|---|---|
NotImplementedError
|
If not implemented by a subclass. |
processed_file_names
property
Processed-file sentinel list for determining whether processing is done.
The default convention for Flooder datasets is
_done: an empty file indicating processing completionsplits.yaml: split definitions copied to the processed directory
Returns:
| Type | Description |
|---|---|
list[str]
|
list[str]: List of required processed file names. |
download
Download the dataset archive from Google Drive into raw_dir.
Constructs a Google Drive download URL from file_id and downloads
into raw_dir/<raw_file_names[0]> using gdown. After downloading,
calls validate(...) to check integrity.
Raises:
| Type | Description |
|---|---|
IndexError
|
If |
OSError
|
If the destination file cannot be written. |
Exception
|
Propagates errors from |
get
Return the in-memory data object at the given global index.
This implementation assumes _load() has populated self.data with
objects saved in processed_dir as .pt files.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
idx
|
int
|
Global index into |
required |
Returns:
| Name | Type | Description |
|---|---|---|
FlooderData |
FlooderData
|
The data item at |
get_split_indices
Extract split indices from raw splits.yaml content.
The default behavior expects the raw splits.yaml to contain a top-level
key "splits" holding the split definitions.
Subclasses may override this method if their splits.yaml uses a different
schema.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
splits_data
|
dict
|
Parsed YAML content from |
required |
Returns:
| Name | Type | Description |
|---|---|---|
Any |
dict
|
The split indices structure to be saved into |
dict
|
|
|
dict
|
dicts ( |
len
Return the number of examples in the full dataset.
Returns:
| Name | Type | Description |
|---|---|---|
int |
int
|
Total number of examples, equal to |
process
Process the extracted raw dataset into serialized .pt files.
Processing performs the following steps:
1) Ensure the archive has been extracted into raw_dir/<folder_name>/.
2) Load metadata from meta.yaml and split definitions from splits.yaml.
3) Save extracted split indices into processed_dir/splits.yaml.
4) Iterate over all .npy files in the extracted folder, sorted by name.
5) For each .npy, call process_file(file, ydata) and save the returned
object as <stem>.pt in processed_dir.
6) Create the _done sentinel file in processed_dir.
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If required raw files ( |
YAMLError
|
If YAML parsing fails. |
OSError
|
For I/O errors reading raw files or writing processed files. |
RuntimeError
|
If |
process_file
Convert a raw .npy file into a FlooderData-like object.
Subclasses must implement the dataset-specific logic for reading file
(typically via numpy.load) and for producing an instance of FlooderData
(or a subclass like FlooderRocksData).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file
|
Path
|
Path to a |
required |
ydata
|
dict
|
Metadata loaded from |
required |
Returns:
| Name | Type | Description |
|---|---|---|
FlooderData |
None
|
Processed data object to be saved as a |
Raises:
| Type | Description |
|---|---|
NotImplementedError
|
If not implemented by a subclass. |
unzip_file
Decompress and extract the dataset archive into raw_dir.
This method reads the first file in raw_paths as a .tar.zst archive,
decompresses it using zstandard, and extracts it using tarfile.
Extraction behavior depends on Python version
- Python >= 3.12: uses
tar.extractall(..., filter='data')to apply tarfile's safety filter. - Older versions: falls back to
tar.extractall(...).
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If |
TarError
|
If the archive is invalid or cannot be read. |
ZstdError
|
If decompression fails. |
OSError
|
For I/O errors during reading or extraction. |
Security
Be careful extracting archives from untrusted sources. While Python
3.12's filter='data' mitigates some risks, older versions extract
without filtering.
validate
Validate a downloaded archive against the expected SHA256 checksum.
Computes the SHA256 digest of file_path and compares it to self.checksum.
If they do not match, emits a UserWarning.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
Path | str
|
Path to the downloaded archive. |
required |
Warns:
| Type | Description |
|---|---|
UserWarning
|
If the computed checksum differs from the expected checksum. |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If |
OSError
|
For I/O errors reading the file. |
LargePointCloudDataset
LargePointCloudDataset(
root: str,
fixed_transform: Callable[[FlooderData], FlooderData]
| None = None,
transform: Callable[[FlooderData], FlooderData]
| None = None,
)
Bases: FlooderDataset
LargePointCloudDataset dataset with large-scale point-clouds used in the Flooder paper.
This dataset contains two point clouds with more than 10M points each,
distributed as a compressed .tar.zst archive and hosted on Google Drive.
The archive is downloaded, validated via SHA256, extracted into
raw_dir/<folder_name>/, and processed into per-sample .pt files.
The processed sample are stored as a LargePointCloudData dataclass with the following attributes
x:torch.FloatTensorof point coordinatesname: sample identifierdescription: brief description of the point cloud
Expected extracted raw directory structure
raw/large/ meta.yaml coral.pt virus.pt
See Also
FlooderDataset: Implements the shared download, processing, and loading pipeline.
checksum
property
Expected SHA256 checksum of the downloaded archive.
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
Lowercase hex-encoded SHA256 digest for |
file_id
property
Google Drive file id for the LargePointCloudDataset dataset archive.
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
Google Drive file id used to construct the download URL. |
folder_name
property
Name of the extracted raw folder under raw_dir.
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
Folder name containing the extracted large dataset files. |
raw_file_names
property
Raw archive file name(s) expected in raw_dir.
Returns:
| Type | Description |
|---|---|
list[str]
|
list[str]: List containing the dataset archive file name. |
uncompressed_file_names
property
Uncompressed file name(s) expected in raw_dir.
Returns:
| Type | Description |
|---|---|
list[str]
|
list[str]: List containing the uncompressed file names. |
get
Return the data object at a given index (either 0 or 1).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
idx
|
int
|
Index to access. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
LargePointCloudData |
LargePointCloudData
|
The data object at the given index. |
MCBDataset
MCBDataset(
root: str,
fixed_transform: Callable[[FlooderData], FlooderData]
| None = None,
transform: Callable[[FlooderData], FlooderData]
| None = None,
)
Bases: FlooderDataset
MCB point-cloud dataset used in the Flooder paper.
This dataset consists of 1745 point clouds, each comprising 1 million points uniformly sampled from surface meshes from a subset of the MCB dataset (A large-scale annotated mechanical components benchmark for classification and retrieval tasks with deep neural networks, ECCV, 2020) available at https://github.com/stnoah1/mcb.
The dataset is distributed as a compressed .tar.zst archive hosted on
Google Drive. The archive is downloaded, validated using a SHA256 checksum,
extracted into raw_dir/<folder_name>/, and processed into per-sample
.pt files stored in processed_dir.
Each raw sample is stored as a .npy array containing quantized point
coordinates. During processing, coordinates are normalized by dividing
by 32767 and cast to float32.
The processed sample representation is
x:torch.FloatTensorof normalized point coordinatesy: integer class labelname: sample identifier derived from the file stem
Expected extracted raw directory structure
raw_dir/mcb/ meta.yaml splits.yaml *.npy
See Also
FlooderDataset: Implements the shared download, processing, and loading pipeline.
checksum
property
Expected SHA256 checksum of the downloaded archive.
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
Lowercase hex-encoded SHA256 digest for |
file_id
property
Google Drive file id for the MCB dataset archive.
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
Google Drive file id used to construct the download URL. |
folder_name
property
Name of the extracted raw folder under raw_dir.
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
Folder name containing the extracted MCB dataset files. |
raw_file_names
property
Raw archive file name(s) expected in raw_dir.
Returns:
| Type | Description |
|---|---|
list[str]
|
list[str]: List containing the dataset archive file name. |
process_file
Convert a raw .npy file into a FlooderData example.
Loads the raw point cloud from file, normalizes coordinates by dividing
by 32767, casts to float32, and converts to a PyTorch tensor. The
class label is read from the dataset metadata.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file
|
Path
|
Path to the raw |
required |
ydata
|
dict
|
Parsed YAML metadata from |
required |
Returns:
| Name | Type | Description |
|---|---|---|
FlooderData |
FlooderData
|
Processed example with fields |
Raises:
| Type | Description |
|---|---|
KeyError
|
If the label entry for |
OSError
|
If the |
ValueError
|
If the loaded array cannot be converted to |
ModelNet10Dataset
ModelNet10Dataset(
root: str,
fixed_transform: Callable[[FlooderData], FlooderData]
| None = None,
transform: Callable[[FlooderData], FlooderData]
| None = None,
)
Bases: FlooderDataset
ModelNet10 point-cloud dataset (250k points) used in the Flooder paper.
This dataset consists of 4899 point clouds, each comprising 250k points uniformly sampled from surface meshes from the ModelNet10 dataset (Wu et al., 3D ShapeNets: A Deep Representation for Volumetric Shapes, CVPR 2015) available at https://modelnet.cs.princeton.edu/.
The dataset is distributed as a compressed .tar.zst archive hosted on
Google Drive. The archive is downloaded, validated using a SHA256 checksum,
extracted into raw_dir/<folder_name>/, and processed into per-sample
.pt files stored in processed_dir.
Each raw sample is stored as a .npy array containing quantized point
coordinates. During processing, coordinates are normalized by dividing
by 32767 and cast to float32.
The processed sample representation is
x:torch.FloatTensorof normalized point coordinatesy: integer class label in[0, 9]name: sample identifier derived from the file stem
Expected extracted raw directory structure
raw_dir/modelnet10_250k/ meta.yaml splits.yaml *.npy
See Also
FlooderDataset: Implements the shared download, processing, and loading pipeline.
RocksDataset
RocksDataset(
root: str,
fixed_transform: Callable[[FlooderData], FlooderData]
| None = None,
transform: Callable[[FlooderData], FlooderData]
| None = None,
)
Bases: FlooderDataset
Rock voxel dataset converted to point clouds with geometric targets.
This synthetic dataset consists of 1000 3D binary voxel grids representing rock samples from two classes. The voxel grids are produced by the PoreSpy library (https://porespy.org/) with classes corresponding to the generation method, fractal noise and blobs, each with 500 samples.
The dataset is distributed as a compressed archive (rocks.tar.zst)
hosted on Google Drive. During processing, each voxel grid is converted
into a set of 3D points by extracting the coordinates of occupied voxels and adding small random jitter to break lattice the lattice structure.
In addition to the class label, each sample includes continuous targets such as surface area and volume.
Processed sample representation
x:torch.FloatTensorof shape(N, 3)containing point coordinatesy: integer class labelsurface: float-valued surface area targetvolume: float-valued volume targetname: sample identifier derived from the file stem
Expected extracted raw directory structure
raw_dir/rocks/ meta.yaml splits.yaml *.npy
See Also
FlooderDataset: Implements the shared download, processing, and loading pipeline.
process_file
Convert a raw voxel .npy file into a FlooderRocksData example.
Processing steps
1) Load the bit-packed voxel array from file.
2) Unpack bits into a boolean array of shape (256, 256, 256).
3) Extract the indices of occupied voxels using np.where.
4) Convert voxel indices to float coordinates and add small random
jitter to avoid degenerate lattice structure.
5) Attach label and continuous targets from metadata.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file
|
Path
|
Path to the raw |
required |
ydata
|
dict
|
Parsed YAML metadata from |
required |
Returns:
| Name | Type | Description |
|---|---|---|
FlooderRocksData |
FlooderRocksData
|
Processed example with fields |
FlooderRocksData
|
|
Raises:
| Type | Description |
|---|---|
KeyError
|
If required metadata entries are missing. |
ValueError
|
If the unpacked voxel array cannot be reshaped to
|
OSError
|
If the |
SwisscheeseDataset
SwisscheeseDataset(
root: str,
ks: list[int] = [10, 20],
num_per_class: int = 500,
num_points: int = 1000000,
fixed_transform: Callable[[FlooderData], FlooderData]
| None = None,
transform: Callable[[FlooderData], FlooderData]
| None = None,
)
Bases: FlooderDataset
Synthetic "Swiss cheese" point-cloud dataset used in the Flooder paper.
This dataset is generated procedurally (no download). Each sample consists of points uniformly sampled from a 3D axis-aligned box with multiple spherical voids removed ("Swiss cheese"). The number of voids defines the class label.
Unlike the other FlooderDataset subclasses that download a compressed
archive, this dataset overrides process() to generate samples and write
them directly to processed_dir as .pt files. Split definitions are also
generated and saved to processed_dir/splits.yaml.
Class semantics
- Each class corresponds to a value
kinks, wherekis the number of spherical voids carved out of the sampling volume. - Label
yis the integer index intoks(i.e.,kifrom enumeration).
Generated file naming
Each sample is saved under a short SHA256-derived identifier computed from the generated point array bytes. This provides deterministic naming for a fixed RNG seed and generation implementation, but note that changes to sampling code, dtype, or ordering can change the resulting hash.
Splits
process() generates 10 random splits (keys 0..9), each containing
trn, val, and tst partitions with proportions 72% / 8% / 20%,
respectively, over the full dataset indices.
Notes
- Because generation is performed inside
process(), instantiation may be compute- and storage-intensive, depending onnum_pointsand dataset size. - This class sets a fixed RNG seed (
np.random.RandomState(42)) for split generation. The point generation itself depends on the behavior ofgenerate_swiss_cheese_points(and any randomness inside it).
Initialize the Swiss cheese dataset generator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
root
|
str
|
Root directory where the dataset is stored. |
required |
ks
|
list[int]
|
List of void counts, one per class. Each |
[10, 20]
|
num_per_class
|
int
|
Number of samples generated for each class.
Total dataset size is |
500
|
num_points
|
int
|
Number of points generated per sample point cloud. |
1000000
|
fixed_transform
|
Callable[[FlooderData], FlooderData] | None
|
Optional transform applied once per example during |
None
|
transform
|
Callable[[FlooderData], FlooderData] | None
|
Optional transform applied on-the-fly in |
None
|
Notes
- Split generation uses a fixed seed (
42) vianp.random.RandomState. - Generation and serialization are performed during
process(), which is invoked duringFlooderDatasetconstruction if processed artifacts are missing.
folder_name
property
Name of the raw folder under raw_dir.
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
Folder name. Included for API compatibility; this dataset does not |
str
|
use extracted raw archives. |
raw_file_names
property
Raw-file requirements for download skipping.
This dataset is generated locally and does not require downloaded raw files, so this returns an empty list.
Returns:
| Type | Description |
|---|---|
list[str]
|
list[str]: Empty list. |
process
Generate synthetic samples and write processed artifacts to disk.
This method generates
processed_dir/splits.yaml: A dict of 10 random splits with keys0..9.- One
.ptfile per generated sample containing aFlooderDataobject. processed_dir/_done: A sentinel file indicating processing completion.
Sample generation details
- Points are generated inside an axis-aligned box with corners
rect_min = [0,0,0]andrect_max = [5,5,5]. - For each class
kinks, the generator createsnum_per_classsamples usinggenerate_swiss_cheese_points(num_points, ..., k, ...). - Each sample is labeled with
y = kiwherekiis the index ofkinks.
Raises:
| Type | Description |
|---|---|
OSError
|
If the processed directory cannot be written. |
RuntimeError
|
If |
Exception
|
Propagates exceptions from |