Text-to-CT Model Weights
Checkpoints for “Text-to-CT Generation via 3D Latent Diffusion Model with Contrastive Vision-Language Pretraining” (Molino et al., 2025).
Model Card for Text-to-CT Generation via 3D Latent Diffusion Model with Contrastive Vision-Language Pretraining
Model Description
- Authors: Daniele Molino, Camillo Maria Caruso, Filippo Ruffini, Paolo Soda, Valerio Guarrasi
- Model type: 3D latent diffusion (RFlow) + 3D VAE + CLIP3D text encoder for CT generation.
- License: Apache 2.0 (same as code release).
- Sources: Code https://github.com/cosbidev/Text2CT | Paper https://arxiv.org/abs/2506.00633
- Demo: Use
diff_model_demo.pyfrom the code release for a one-off generation from text.
Intended Use
- Direct use: Research/experimentation on text-conditioned 3D CT synthesis; generating synthetic data for benchmarking or augmentation.
- Downstream use: Fine-tuning or integration into broader research pipelines.
- Out of scope: Clinical decision-making, diagnostic use, or deployment without proper validation and approvals.
Risks & Limitations
- Trained on CT-RATE; may encode dataset biases and is not validated for clinical use.
- Synthetic outputs may contain artifacts; do not use for patient care.
Files included
autoencoder_epoch273.pt— 3D VAE for latent compression/decoding.unet_rflow_200ep.pt— Diffusion UNet trained with rectified flow.CLIP3D_Finding_Impression_30ep.pt— CLIP3D weights for encoding reports.
How to Get Started (Python)
from huggingface_hub import hf_hub_download
repo_id = "yourname/text2ct-weights" # replace with the actual repo id
autoencoder_path = hf_hub_download(repo_id, "autoencoder_epoch273.pt")
unet_path = hf_hub_download(repo_id, "unet_rflow_200ep.pt")
clip_path = hf_hub_download(repo_id, "CLIP3D_Finding_Impression_30ep.pt")
# Use these in the code release configs:
# trained_autoencoder_path -> autoencoder_path
# existing_ckpt_filepath / model_filename -> unet_path
# clip_weights (for report embeddings) -> clip_path
Training Data (for these weights)
- CT-RATE dataset (public on Hugging Face) for CT volumes and reports.
Training Procedure (summary)
- CLIP3D trained for vision-language alignment on CT+reports.
- VAE checkpoint from https://github.com/Project-MONAI/tutorials/tree/main/generation/maisi.
- Diffusion UNet trained with rectified flow (RFlow) in latent space, conditioned on text embeddings.
Evaluation
- See paper for quantitative and qualitative results.
Further Information
- 1,000 generated CT scans are available at https://huggingface.co/datasets/dmolino/CT-RATE_Generated_Scans.
Environmental Impact
- Not reported. Training used multi-GPU setup;.
Citation
If you use these weights or code, please cite the paper:
@article{molino2025text,
title={Text-to-CT Generation via 3D Latent Diffusion Model with Contrastive Vision-Language Pretraining},
author={Molino, Daniele and Caruso, Camillo Maria and Ruffini, Filippo and Soda, Paolo and Guarrasi, Valerio},
journal={arXiv preprint arXiv:2506.00633},
year={2025}
}
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support