Predictor of MICs of any bacterial species

Updated: Fri 31 Oct 03:24:57 GMT 2025

Trained on the human-curated SPARK dataset (77408 rows in total, HF dataset).

Model details

This model was trained using our DuvidNN framework, as a result of hyperparameter searches and selecting the model that performs best on unseen test data (from a scaffold split).

DuvidNN also saves the training data in this checkpoint to allows the calculation of uncertainty metrics based on that training data.

This model is the best regression model from a hyperparameter search, determined by Pearson's r on a held-out test set not seen in training or early stopping.

Model architecture

Regression


{
    "class_name": "bilinear-fp",
    "context": [
        "mic_method:hash"
    ],
    "dropout": 0.0,
    "ensemble_size": 10,
    "features": [
        [
            "full_strain_name:vectome-fingerprint"
        ]
    ],
    "learning_rate": 1e-06,
    "merge_method": "product",
    "n_hidden": 4,
    "n_units": 32,
    "residual_depth": 0
}

Model usage

You can use this model with:

from duvida.autoclasses import AutoModelBox
modelbox = AutoModelBox.from_pretrained("hf://scbirlab/spark-dv-2510")
modelbox.predict(data=..., inputs=[...], columns=[...])  # make predictions on your own data

Training details

Dataset: SPARK (77408 rows in total for species)
Input column: smiles
Output column: pmic
Split type: Murcko scaffold
Split proportions:
- 70% training (54186 rows)
- 15% validation (for early stopping) (11610 rows)
- 15% test (for selecting hyperparameters) (11612 rows)

Here is the training log:

And these are the evaluation scores.

Train (54186 rows):


{
    "pearson_r": 0.848129490790416,
    "rmse": 0.6399003267288208,
    "spearman_rho": 0.8231875255873609
}

Validation (11610 rows):


{
    "pearson_r": 0.6856455233976995,
    "rmse": 0.5085195899009705,
    "spearman_rho": 0.695300235716624
}

Test (11612 rows):


{
    "pearson_r": 0.6732573419039387,
    "rmse": 0.6368110179901123,
    "spearman_rho": 0.5967009278221924
}

Training data details

The training data were collated by the authors of:

Joe Thomas, Marc Navre, Aileen Rubio, and Allan Coukell Shared Platform for Antibiotic Research and Knowledge: A Collaborative Tool to SPARK Antibiotic Discovery ACS Infectious Diseases 2018 4 (11), 1536-1539 DOI: 10.1021/acsinfecdis.8b00193

We cleaned the original SPARK dataset to subset the most relevant columns, remove empty values, and give succint column titles.

Dataset Sources

Repository: https://www.collaborativedrug.com/spark-data-downloads
Paper: https://doi.org/10.1021/acsinfecdis.8b00193

Data Collection and Processing

Data were processed using schemist, a tool for processing chemical datasets.

The SMILES strings have been canonicalized, and split into training (70%), validation (15%), and test (15%) sets by Murcko scaffold for each species with more than 1000 entries. Additional features like molecular weight and topological polar surface area have also been calculated.

Who are the source data producers?

Joe Thomas, Marc Navre, Aileen Rubio, and Allan Coukell

Downloads last month: -; Downloads are not tracked for this model. How to track

scbirlab
/

spark-dv-2510-wt