All
Blog
19 February 2026

Good metrics won’t save lives

How to meaningfully validate AI in drug discovery

In the race to meaningfully integrate Artificial Intelligence into drug discovery, particularly in predicting toxicology, we often get seduced by high-scoring metrics. An AUC of 0.95! An F1 score of 0.88! These numbers look impressive on a training log, offering a comforting sense of rigour and progress. They suggest that a complex biological problem has been brought under control.

But let's face a sobering truth: Good metrics won't save lives.

When developing and validating AI models for predicting adverse drug reactions (ADRs), such as cardiotoxicity, hepatotoxicity, or even common off-target effects, we need to look beyond optimising statistical performance. We need to ask whether these models can reliably support real decisions about compound progression, dosing, and patient safety.

Most validation metrics are derived from training and test sets that are small, carefully curated, and confined to narrow regions of chemical space. The impact of this is that strong performance often reflects familiarity with the dataset rather than robustness to novelty. These benchmarks rarely capture the biological and chemical uncertainty encountered in real drug development.

This has been discussed at length in the past few years, but recent work has focused on understanding exactly why these benchmarks don’t work, enabling the development of better and more rigorous benchmarks (see: OpenADMET). Pat Walters recently posted a visual demonstration as to why a common benchmark used for virtual screening (DUD-E), is flawed when used to validate machine learning models. The decoys (inactives) simply look different from the true actives, and that’s the pattern the model - a glorified pattern matcher - will learn, rather than the underlying physics behind drug-protein binding. 

t-SNE mapping of chemical diversity across six DUD-E datasets using RDKit Morgan Fingerprints. In each plot, active compounds (red) exhibit significant structural similarity, clustering together and remaining distinct from the decoy population (light blue). This distribution highlights the inherent "analogue bias" in the dataset, which can lead to inflated machine learning performance metrics. Source: Pat Walters

Other findings from groups like Leash Bio demonstrate that models learn other surprising patterns, such as the chemist who designed the compound.


Is Validation Ever “Good Enough”? Why AI Toxicology Models Are Failing in the Real World

The core problem lies in the disconnect between the in silico world and biological reality..

  1. Toxicology datasets are notoriously sparse and biased. Models are typically trained on limited collections of well-studied compounds, often drawn from similar regions of chemical space. When a genuinely novel scaffold is introduced, precisely the kind of molecule where AI-toxicology predictive models could offer the greatest value, performance frequently deteriorates.
  2. At the same time, many toxicology models simplify complex biological endpoints into a binary "toxic" or "non-toxic" label. This ignores the critical aspects of dosage, pharmacokinetics, drug-drug interactions, and genetic variability that determine real-world adverse events. A perfect F1 score on a binary classification doesn't capture the nuance of a dose-dependent, reversible side effect versus a rare, life-threatening idiosyncratic reaction.
  3. A similar issue arises from reliance on convenient surrogate endpoints. Models are often trained on in vitro data such as hERG blockade for cardiotoxicity. While useful, predicting hERG inhibition with 99% accuracy is not the same as predicting Torsades de Pointes in a human patient. The metric is superb, but the underlying prediction is a step removed from the ultimate biological consequence.

Taken together, these limitations help explain why strong benchmark performance so often fails to translate into reliable real-world predictions. We are increasingly aware that many current validation approaches are insufficient, and we are beginning to understand why.

The next question, then, is practical. If existing benchmarks fall short, how can we evaluate our models in ways that better reflect the decisions they are meant to support?


From Metrics to Meaning: Rethinking How We Validate AI Toxicology Models

At Ignota Labs, we don't focus our attention on how well our models can predict for a small set of molecules in a limited chemical space. We care about one thing: How well can this model be used to save lives, by understanding adverse events and accelerating the delivery of safer medicines?

To bridge this gap, we must move toward more practical, rigorous, and clinically relevant use cases for AI toxicology models. Instead of resting on internal validation metrics, demand evidence of external, consequential validation. 

Here are examples of how AI toxicology models can be validated with a focus on real-world impact:

Use Case Goal Traditional Metric Focus Proposed Rigorous Validation & Data Source
Predicting Clinical Adverse Events Accuracy on in-house or public ADR databases (binary classification). Challenge
Does your model, trained on secondary pharmacology (Sec Pharm) panel data, predict the incidence of Drug-Induced Liver Injury (DILI) observed in Phase 1 Clinical Trial data?
Prioritising Chemical Synthesis Enrichment factor (EF) or area under the curve (AUC). Challenge
When applied prospectively to a virtual library of 10,000 novel compounds, does the model successfully filter out the top 5% most likely toxic candidates, resulting in a statistically significant reduction in the toxicity rate of the synthesised molecules compared to a non-AI-guided approach?
De-risking Lead Optimisation Sensitivity and specificity for a defined toxic mechanism. Challenge
Can the model successfully re-rank a set of clinical-stage candidates (some failed, some successful) based on their actual observed safety profile in humans (e.g., post-market surveillance or FDA adverse event reporting data)?
Identifying Toxic Alerts in Novel Scaffolds Performance on common chemotypes (known toxins). Challenge
Does the model flag the known toxic liabilities (e.g., DNA damage, reactive metabolite formation) in a set of proprietary, never-before-seen scaffolds that were later determined to be toxic in vivo during preclinical testing?


Human biology doesn't care about your F1 score. Until our validation frameworks account for the 90% failure rate in clinical transitions, and the fact that over 80% of genomic data used in drug discovery still comes from populations of European descent, leaving massive gaps in safety for the rest of the world, AI will remain a peripheral tool. Until industry benchmarks reflect the "messy" reality of global biology and the high-stakes environment of clinical trials, we aren't revolutionising drug safety—we're just refining the status quo.

At Ignota Labs, we are trading statistical comfort for clinical accountability. A model's value isn't an arbitrary decimal point in a report; it's the measurable reduction of late-stage attrition. It’s the number of potential tragedies intercepted before a molecule ever touches a patient.

Author: Dr Layla Hosseini-Gerami