stroke-deepisles-demo / DATA-PIPELINE.md
VibecoderMcSwaggins's picture
fix(arch): Config SSOT, reproducible builds, and data pipeline documentation (#41)
ba32591 unverified
# Data Pipeline
> **The Problem:** HuggingFace `datasets` doesn't natively support NIfTI/BIDS neuroimaging formats.
> **The Solution:** `neuroimaging-go-brrrr` extends `datasets` with `Nifti()` feature type.
---
## What is neuroimaging-go-brrrr?
```text
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ neuroimaging-go-brrrr EXTENDS HUGGINGFACE DATASETS β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ β”‚
β”‚ pip install datasets pip install neuroimaging-go-brrrr β”‚
β”‚ ─────────────────────── ───────────────────────────────── β”‚
β”‚ Standard HuggingFace EXTENDS datasets with: β”‚
β”‚ β€’ Images, text, audio β€’ Nifti() feature type for .nii.gz β”‚
β”‚ β€’ Parquet/Arrow storage β€’ BIDS directory parsing β”‚
│ ‒ Hub integration ‒ Upload utilities (BIDS→Hub) │
β”‚ β€’ Validation utilities β”‚
β”‚ β€’ Bug workarounds for upstream issues β”‚
β”‚ β”‚
β”‚ When you install neuroimaging-go-brrrr, you get: β”‚
β”‚ β€’ A patched datasets library with Nifti() support (pinned git commit) β”‚
β”‚ β€’ bids_hub module for upload/validation β”‚
β”‚ β€’ All upstream bug workarounds in one place β”‚
β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
**Key insight:** `neuroimaging-go-brrrr` pins to a specific commit of `datasets` that includes `Nifti()` support:
```toml
# From neuroimaging-go-brrrr/pyproject.toml
[tool.uv.sources]
datasets = { git = "https://github.com/huggingface/datasets.git", rev = "004a5bf4..." }
```
---
## The Two Pipelines
### Pipeline 1: UPLOAD (How Data Gets to HuggingFace)
```text
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Local BIDS β”‚ β”‚ neuroimaging-go- β”‚ β”‚ HuggingFace Hub β”‚
β”‚ Directory β”‚ ──► β”‚ brrrr (bids_hub) β”‚ ──► β”‚ hugging-science/ β”‚
β”‚ (Zenodo) β”‚ β”‚ β”‚ β”‚ isles24-stroke β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β€’ build_isles24_ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ file_table() β”‚
β”‚ β€’ Nifti() features β”‚
β”‚ β€’ push_to_hub() β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
### Pipeline 2: CONSUMPTION (How This Demo Loads Data)
**THE CORRECT PATTERN:**
```python
from datasets import load_dataset
# neuroimaging-go-brrrr provides the patched datasets with Nifti() support
ds = load_dataset("hugging-science/isles24-stroke", split="train")
# Access data - Nifti() returns nibabel.Nifti1Image objects
example = ds[0]
dwi = example["dwi"] # nibabel.Nifti1Image (NOT numpy array)
adc = example["adc"] # nibabel.Nifti1Image
lesion_mask = example["lesion_mask"] # nibabel.Nifti1Image
# To get numpy array: dwi.get_fdata()
# To save to file: dwi.to_filename("dwi.nii.gz")
```
This is the **intended consumption pattern**. It should just work because:
1. `neuroimaging-go-brrrr` provides the patched `datasets` with `Nifti()` support
2. The dataset was uploaded with `Nifti()` features
3. `Nifti(decode=True)` returns nibabel images with affine/header preserved
---
## Current State: REFACTOR NEEDED
**Problem:** stroke-deepisles-demo currently has a hand-rolled workaround in `data/adapter.py` that bypasses `datasets.load_dataset()`. This workaround uses `HfFileSystem` + `pyarrow` directly to download individual parquet files.
**Why this is wrong:**
1. Duplicates bug workarounds that should live in `neuroimaging-go-brrrr`
2. Doesn't use the `Nifti()` feature type properly
3. Harder to maintain - fixes need to happen in multiple places
**The fix:**
1. Delete the custom `HuggingFaceDataset` adapter in `data/adapter.py`
2. Use standard `datasets.load_dataset()` consumption pattern
3. If there are bugs, fix them in `neuroimaging-go-brrrr`, not locally
---
## Dependency Relationship
```text
stroke-deepisles-demo (this repo)
β”‚
└── neuroimaging-go-brrrr @ v0.2.1
β”‚
└── datasets @ git commit 004a5bf4... (patched with Nifti())
└── huggingface-hub
└── bids_hub module (upload + validation utilities)
```
**The consumption should flow through the standard pattern:**
```text
stroke-deepisles-demo
β”‚
β”‚ from datasets import load_dataset
β”‚ ds = load_dataset("hugging-science/isles24-stroke")
β–Ό
neuroimaging-go-brrrr (provides patched datasets)
β”‚
β”‚ Nifti() feature type handles lazy loading
β–Ό
HuggingFace Hub (isles24-stroke dataset)
```
---
## Dataset Info
| Property | Value |
|----------|-------|
| Dataset ID | `hugging-science/isles24-stroke` |
| Subjects | 149 |
| Modalities | DWI, ADC, Lesion Mask, NCCT, CTA, CTP, Perfusion Maps |
| Source | [Zenodo 17652035](https://zenodo.org/records/17652035) |
---
## What bids_hub Provides
```text
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ neuroimaging-go-brrrr (bids_hub) β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ β”‚
β”‚ FOR UPLOADING: FOR CONSUMING: β”‚
β”‚ ────────────── ────────────── β”‚
β”‚ build_isles24_file_table() Patched datasets with Nifti() β”‚
β”‚ get_isles24_features() └── Use standard load_dataset() β”‚
β”‚ push_dataset_to_hub() β”‚
β”‚ validate_isles24_download() β”‚
β”‚ We DON'T use these in this demo. └── ISLES24_EXPECTED_COUNTS β”‚
β”‚ Dataset already uploaded. └── Can use for sanity checking β”‚
β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
---
## Related Documentation
- [neuroimaging-go-brrrr](https://github.com/The-Obstacle-Is-The-Way/neuroimaging-go-brrrr)
- [isles24-stroke dataset card](https://huggingface.co/datasets/hugging-science/isles24-stroke)
---
## TODO: Refactor Data Loading
The current hand-rolled adapter in `data/adapter.py` should be replaced with standard `datasets.load_dataset()` consumption. This refactor should:
1. Remove `HuggingFaceDataset` class from `data/adapter.py`
2. Update `data/loader.py` to use `datasets.load_dataset()`
3. Remove pre-computed constants in `data/constants.py` (no longer needed)
4. Test that `Nifti()` lazy loading works correctly
5. If bugs are found, report/fix them in `neuroimaging-go-brrrr`