Check out the pre-print on Arxiv
We provide different fine-tuned checkpoints as a collection on HuggingFace.
We additionally release the SynthCheX-75K dataset comprising of >75K high-quality, synthetic radiographs using the best-performing model from our benchmark. The dataset is available for use on HuggingFace.
More details on the dataset are provided here.
The benchmark is available on GitHub.
git clone https://github.com/Raman1121/CheXGenBench.git
conda create -n myenv python=3.10
conda activate myenv
pip install -r requirements.txt
The benchmark currently supports SD V1.x, SD V2.x, SD V3.5, Pixart Sigma, RadEdit, Sana (0.6B), Lumina 2.0, Flux.1-Dev, LLM-CXR.
In order to add a new model in the benchmark, follow these (easy) steps. Note: We assume that training of your T2I model is conducted separately from the benchmark.
To generate images for calculating quantitative metrics (FID, KID, etc), define a new function in the tools/generate_data_common.py
file that handles the checkpoint loading logic for the new model.
Add a new if
statement in the load_pipeline
function that calls this function.
Add the generation parameters (num_inference_steps, guidance_scale, etc) in the PIPELINE_CONSTANTS
dictionary.
In order to evaluate T2I models, the first step is to generate synthetic images using a fixed set of prompts. Follow these steps to generate synthetic images to be used for evaluation.
MIMIC_Splits/
folder.
cd MIMIC_Splits/
unzip llavarad_annotations.zip
MIMIC_Splits/LLAVARAD_ANNOTATIONS_TRAIN.csv
MIMIC_Splits/LLAVARAD_ANNOTATIONS_TEST.csv
MIMIC_Splits/LLAVARAD_ANNOTATIONS_TEST.csv
file to generate images for evaluation. Follow the steps in the previous section to use tools/generate_data_common.py
script to generate images.prompt_INFO.csv
).prompt_INFO.csv
) with the following columns:
'prompt'
: Contains the text prompt used for generation.'img_savename'
: Contains the filename (or path) of the saved synthetic image.prompt_INFO.csv
) in the assets/CSV
directory.assets/synthetic_images
directory.This section provides instructions on how to use the benchmark to evaluate your Text-to-Image model’s synthetic data.
The quantitative analysis assesses the synthetic data at two distinct levels to provide a granular understanding of its quality:
Overall Analysis: This level calculates metrics across the entire test dataset, consisting of all pathologies present in the MIMIC dataset. It provides a general indication of the synthetic data’s overall quality.
cd Benchmarking-Synthetic-Data
./scripts/image_quality_metrics.sh
Important Note: Calculating metrics like FID and KID can be computationally intensive and may lead to “Out of Memory” (OOM) errors, especially with large datasets (If using V100 GPUs or lower). If you encounter this issue, you can use the memory-saving version of the script:-
cd Benchmarking-Synthetic-Data
./scripts/image_quality_metrics_memory_saving.sh
The results would be stored in Results/image_generation_metrics.csv
Image-Text Alignment We calculate the alignment between a synthetic image and a prompt using the Bio-ViL-T model. Using this requires setting up a separate environment due to different dependencies.
himl
himl
.conda activate himl
pip install -r requirements_himl.txt
When the environment is set-up, run the following command:
./scripts/img_text_alignment.sh
Conditional Analysis: This level calculates each metric separately for each individual pathology present in the dataset. This allows for a detailed assessment of how well the T2I model generates synthetic data for specific medical conditions.
cd Benchmarking-Synthetic-Data
./scripts/image_quality_metrics_conditional.sh
The results would be stored in Results/conditional_image_generation_metrics.csv
EXTRA_INFO
argument when running the scripts (refer to the example scripts for specific usage).
First, download the Patient Re-Identification Model from HERE and place it in assets/checkpoints/
folder. The name of the checkpoint is ResNet-50_epoch11_data_handling_RPN.pth.
Set the appropriate paths and constants in the scripts/privacy_metrics.sh
file.
Run the following script to calculate privacy and patient re-identification metrics.
cd Benchmarking-Synthetic-Data
./scripts/privacy_metrics.sh
For image classification, we used 20,000 samples from the MIMIC Dataset for training. To evaluate, you first need to generate synthetic samples using the same 20,000 prompts with your T2I Model.
cd MIMIC_Splits/Downstream_Classification_Files
unzip training_data_20K.zip
tools/generate_data_common.py
file to generate synthetic images.SYNTHETIC_IMAGES
)cd Downstream/Classification
/scripts/run_training_inference.sh
./scripts/run_training_inference.sh
To fine-tune LLaVA-Rad, the first step is creating a new environment following the steps mentioned in the official LLaVA-Rad repository.
If you found our work useful, please consider citing.
@article{dutt2025chexgenbench,
title={CheXGenBench: A Unified Benchmark For Fidelity, Privacy and Utility of Synthetic Chest Radiographs},
author={Dutt, Raman and Sanchez, Pedro and Yao, Yongchen and McDonagh, Steven and Tsaftaris, Sotirios A and Hospedales, Timothy},
journal={arXiv preprint arXiv:2505.10496},
year={2025}
}
For questions, please send your queries at raman.dutt@ed.ac.uk