fix(arch): Config SSOT, reproducible builds, and data pipeline documentation (#41)
ba32591
unverified
| # Data Pipeline | |
| > **The Problem:** HuggingFace `datasets` doesn't natively support NIfTI/BIDS neuroimaging formats. | |
| > **The Solution:** `neuroimaging-go-brrrr` extends `datasets` with `Nifti()` feature type. | |
| --- | |
| ## What is neuroimaging-go-brrrr? | |
| ```text | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β neuroimaging-go-brrrr EXTENDS HUGGINGFACE DATASETS β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ | |
| β β | |
| β pip install datasets pip install neuroimaging-go-brrrr β | |
| β βββββββββββββββββββββββ βββββββββββββββββββββββββββββββββ β | |
| β Standard HuggingFace EXTENDS datasets with: β | |
| β β’ Images, text, audio β’ Nifti() feature type for .nii.gz β | |
| β β’ Parquet/Arrow storage β’ BIDS directory parsing β | |
| β β’ Hub integration β’ Upload utilities (BIDSβHub) β | |
| β β’ Validation utilities β | |
| β β’ Bug workarounds for upstream issues β | |
| β β | |
| β When you install neuroimaging-go-brrrr, you get: β | |
| β β’ A patched datasets library with Nifti() support (pinned git commit) β | |
| β β’ bids_hub module for upload/validation β | |
| β β’ All upstream bug workarounds in one place β | |
| β β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| **Key insight:** `neuroimaging-go-brrrr` pins to a specific commit of `datasets` that includes `Nifti()` support: | |
| ```toml | |
| # From neuroimaging-go-brrrr/pyproject.toml | |
| [tool.uv.sources] | |
| datasets = { git = "https://github.com/huggingface/datasets.git", rev = "004a5bf4..." } | |
| ``` | |
| --- | |
| ## The Two Pipelines | |
| ### Pipeline 1: UPLOAD (How Data Gets to HuggingFace) | |
| ```text | |
| βββββββββββββββββββ ββββββββββββββββββββββββ βββββββββββββββββββββββ | |
| β Local BIDS β β neuroimaging-go- β β HuggingFace Hub β | |
| β Directory β βββΊ β brrrr (bids_hub) β βββΊ β hugging-science/ β | |
| β (Zenodo) β β β β isles24-stroke β | |
| βββββββββββββββββββ β β’ build_isles24_ β βββββββββββββββββββββββ | |
| β file_table() β | |
| β β’ Nifti() features β | |
| β β’ push_to_hub() β | |
| ββββββββββββββββββββββββ | |
| ``` | |
| ### Pipeline 2: CONSUMPTION (How This Demo Loads Data) | |
| **THE CORRECT PATTERN:** | |
| ```python | |
| from datasets import load_dataset | |
| # neuroimaging-go-brrrr provides the patched datasets with Nifti() support | |
| ds = load_dataset("hugging-science/isles24-stroke", split="train") | |
| # Access data - Nifti() returns nibabel.Nifti1Image objects | |
| example = ds[0] | |
| dwi = example["dwi"] # nibabel.Nifti1Image (NOT numpy array) | |
| adc = example["adc"] # nibabel.Nifti1Image | |
| lesion_mask = example["lesion_mask"] # nibabel.Nifti1Image | |
| # To get numpy array: dwi.get_fdata() | |
| # To save to file: dwi.to_filename("dwi.nii.gz") | |
| ``` | |
| This is the **intended consumption pattern**. It should just work because: | |
| 1. `neuroimaging-go-brrrr` provides the patched `datasets` with `Nifti()` support | |
| 2. The dataset was uploaded with `Nifti()` features | |
| 3. `Nifti(decode=True)` returns nibabel images with affine/header preserved | |
| --- | |
| ## Current State: REFACTOR NEEDED | |
| **Problem:** stroke-deepisles-demo currently has a hand-rolled workaround in `data/adapter.py` that bypasses `datasets.load_dataset()`. This workaround uses `HfFileSystem` + `pyarrow` directly to download individual parquet files. | |
| **Why this is wrong:** | |
| 1. Duplicates bug workarounds that should live in `neuroimaging-go-brrrr` | |
| 2. Doesn't use the `Nifti()` feature type properly | |
| 3. Harder to maintain - fixes need to happen in multiple places | |
| **The fix:** | |
| 1. Delete the custom `HuggingFaceDataset` adapter in `data/adapter.py` | |
| 2. Use standard `datasets.load_dataset()` consumption pattern | |
| 3. If there are bugs, fix them in `neuroimaging-go-brrrr`, not locally | |
| --- | |
| ## Dependency Relationship | |
| ```text | |
| stroke-deepisles-demo (this repo) | |
| β | |
| βββ neuroimaging-go-brrrr @ v0.2.1 | |
| β | |
| βββ datasets @ git commit 004a5bf4... (patched with Nifti()) | |
| βββ huggingface-hub | |
| βββ bids_hub module (upload + validation utilities) | |
| ``` | |
| **The consumption should flow through the standard pattern:** | |
| ```text | |
| stroke-deepisles-demo | |
| β | |
| β from datasets import load_dataset | |
| β ds = load_dataset("hugging-science/isles24-stroke") | |
| βΌ | |
| neuroimaging-go-brrrr (provides patched datasets) | |
| β | |
| β Nifti() feature type handles lazy loading | |
| βΌ | |
| HuggingFace Hub (isles24-stroke dataset) | |
| ``` | |
| --- | |
| ## Dataset Info | |
| | Property | Value | | |
| |----------|-------| | |
| | Dataset ID | `hugging-science/isles24-stroke` | | |
| | Subjects | 149 | | |
| | Modalities | DWI, ADC, Lesion Mask, NCCT, CTA, CTP, Perfusion Maps | | |
| | Source | [Zenodo 17652035](https://zenodo.org/records/17652035) | | |
| --- | |
| ## What bids_hub Provides | |
| ```text | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β neuroimaging-go-brrrr (bids_hub) β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ | |
| β β | |
| β FOR UPLOADING: FOR CONSUMING: β | |
| β ββββββββββββββ ββββββββββββββ β | |
| β build_isles24_file_table() Patched datasets with Nifti() β | |
| β get_isles24_features() βββ Use standard load_dataset() β | |
| β push_dataset_to_hub() β | |
| β validate_isles24_download() β | |
| β We DON'T use these in this demo. βββ ISLES24_EXPECTED_COUNTS β | |
| β Dataset already uploaded. βββ Can use for sanity checking β | |
| β β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| --- | |
| ## Related Documentation | |
| - [neuroimaging-go-brrrr](https://github.com/The-Obstacle-Is-The-Way/neuroimaging-go-brrrr) | |
| - [isles24-stroke dataset card](https://huggingface.co/datasets/hugging-science/isles24-stroke) | |
| --- | |
| ## TODO: Refactor Data Loading | |
| The current hand-rolled adapter in `data/adapter.py` should be replaced with standard `datasets.load_dataset()` consumption. This refactor should: | |
| 1. Remove `HuggingFaceDataset` class from `data/adapter.py` | |
| 2. Update `data/loader.py` to use `datasets.load_dataset()` | |
| 3. Remove pre-computed constants in `data/constants.py` (no longer needed) | |
| 4. Test that `Nifti()` lazy loading works correctly | |
| 5. If bugs are found, report/fix them in `neuroimaging-go-brrrr` | |