All posts

Lessons from the Polaris ADMET competition

Following our competition win, we explored what the competition results tell us about the state of ML for drug discovery

By 
Alex Rich
June 12, 2025
 • 
12
 min read
Share this post
Copy

Polaris, in conjunction with ASAP Discovery and OpenADMET, recently wrapped up their first blinded prediction competition for ADMET, potency, and ligand pose prediction. Polaris was formed to improve the rigor of machine learning methods evaluation in drug discovery, and this collaboration with ASAP enabled the use of data from ASAP’s MERS-CoV/SARS-CoV-2 antiviral program, for which a preclinical candidate was recently disclosed. The collaboration offered a rare opportunity to benchmark competing ML approaches in a setting closely mirroring real-world application: the data was from a real drug program that was split chronologically into training and test sets, and participants were fully blind to the measured values of the test set.

Compounds from the ASAP antiviral program

We are big believers in the value of improved, program-focused evaluation approaches for ML in drug discovery, so we see this new competition format as a big step forward for the field. We participated and were happy to see that our latest generation of ADMET models achieved first place in the ADMET competition (under username prairewarbler10). The full competition results and methodology reports for each entrant can be found here.

Now that the results are out, we’ve spent some time looking through the data and the reports submitted by each entrant to see what we can learn about what works in ADMET and potency prediction, and what the results might mean for performance in other programs. We had two key takeaways from this exercise:

  1. Adding global ADMET data to the models meaningfully improved performance over local models, but applying only non-ADMET pre-training produced mixed results.
  2. The performance of modeling approaches varies significantly across different programs, limiting the conclusions that can be drawn from just one program.

The first takeaway extends our previous investigations into the role of global vs. local models. On the one hand, the competition results show a clear benefit from bringing outside task-specific (i.e., ADMET) data into the training set of the models. On the other hand, training on massive amounts of non-task-specific data (e.g. large amounts of quantum mechanics data), a technique that has led to other recent breakthroughs in AI, so far continues to have a limited payoff.  

The second takeaway emerged from comparing the ASAP program’s ADMET data to a broad range of programs in Inductive Bio’s consortium. We saw that the performance of ML methods can vary meaningfully based on the choice of program and its unique chemical space. This highlights the value that additional competitions can provide to the community, so we hope that Polaris will continue to conduct competitions in this blinded format so that we can all learn more about which methods work best.

Below, we dive into the data underlying each of these takeaways. (We also presented a version of these analyses during the winners symposium, available on youtube.) Since many of the reports for the Polaris submissions are high-level summaries without public code, we’ve implemented two baseline models trained only on local data—one descriptors-based and one fingerprint-based—and shared the code and details here. These baselines enable us to make apples-to-apples comparisons across tasks and programs. In the spirit of our “strong baseline” for docking, the baselines are intended to be simple, concise, and competitive, but not necessarily to produce the best performance possible with their respective methods.

‍

Takeaway 1: Bringing in additional ADMET data led to the best model performance in the Polaris-ASAP competition, but massive non-task-specific pretraining had limited payoff

The results of the ADMET competition highlight the value of bringing in additional ADMET data outside the program of interest for model training. Eight of the top ten entrants made use of additional ADMET data, either directly training their model with this data or using pre-trained ADMET models as features. In contrast, only four of the 11th-20th place entrants appeared to use additional ADMET data.

Compared to Inductive’s winning submission, which used additional ADMET training data, the best pretrained deep learning model without additional ADMET data had 23% higher error (5th place) and the best model using traditional ML and no additional data (or pretrained deep learning features) had 41% higher error (12th place). Our descriptor and fingerprint baselines had 53% and 60% higher error, respectively, in line with 20th and 24th place in the competition.

Comparison of selected models on the ADMET competition track. Reported error is an average of log MAE across the five ADMET tasks (HLM, MLM, LogD, KSol @ pH 7.5, and MDR1-MDCKII Papp), which was the primary metric evaluated in the competition.

Several submissions made use of large-scale non-ADMET pretraining. The 5th place submission used MolMCL, an approach based on self-supervised learning from millions of chemical structures with no experimental data. In addition, two submissions used MolE and MolGPS, large models that involve massive supervised training on non-ADMET chemistry data, including experimental data from sources like ChEMBL and PubChem, and computed quantum mechanical properties. A fourth submission used a finetuned LLM (Gemini 1.5 Flash), essentially pretraining on the entire internet. The recent history of machine learning in other fields (e.g., language and vision) suggests that this kind of massive training will ultimately lead to breakthroughs in model performance, allowing the model to learn sophisticated internal representations that can be rapidly adapted to the predictive task at hand. So are we there yet for chemistry?

The results in this competition suggest that we aren’t quite yet. Large pre-trained models are starting to show benefit, but not in a clear and consistent way. After MolMCL, the top-performing model using large non-ADMET pretraining was MolE, which finished in 10th place - ahead of the best local ML model, but with 37% higher error than our winning submission. MolGPS yielded higher error, despite the submission also making use of external ADMET data. The finetuned LLM submission came in a statistical tie with the best local ML model, and also made use of external ADMET data.

In the potency competition, MolE and MolGPS came in first and second place, but this seems to reflect the lack of diverse, target-specific external data that could be brought to bear for potency, rather than these models having higher suitability for potency than ADMET. The gap in error between the winning MolE model and the best submitted local model was 7.6%, and the gap between MolE and our fingerprints baseline is only 1.7%. With the caveat that the fingerprint baseline was not submitted as a truly blind entrant to the competition, this suggests that simple local models can still come very close to a non-task-specific massively pre-trained model’s performance.

Comparison of selected models on the ADMET and Potency competition tracks, highlighting models that used large-scale non-ADMET pretraining.

Overall, this set of results gives a mixed view of the state of general-purpose pretrained chemistry models. Interestingly, the unsupervised MolMCL approach performed better than the supervised MolE and MolGPS approaches, despite analyses in both those papers showing the benefits of supervision. These results indicated MolMCL is worthy of additional attention, but it’s hard to know whether its performance reflects a benefit of unsupervised-only pretraining generally, the MolMCL approach specifically, or is unique to the dataset in this particular competition. As we’ll show below, the results of model performance comparison can vary meaningfully program to program. Ultimately, we believe that massive pretraining will bear fruit for chemistry as it has for other fields, but the keys will be to have both the right pretraining tasks and a broad, rigorous set of evaluations to determine which approaches consistently work well.

‍

Takeaway 2: No single program can tell the whole story of performance across models

The results of the Polaris-ASAP competition give us a rare opportunity to see how different methods stack up in a rigorous, blinded, and program-specific manner. But ultimately, what we can learn from this one competition represents the results of just one program. Chemical space is vast, and what holds in one program might not hold in another. We wanted to get a better intuitive understanding into how much the choice of program matters, and how different the results might look for future program-based competitions.

To do this, we explored performance of our two baseline models across a set of anonymized programs from Inductive Bio’s ADMET consortium. We focused on three key properties—human liver microsomal stability (HLM), kinetic solubility in PBS @ pH 7.4 (KSOL), and MDR1-MDCK1 apparent permeability—and selected several programs for each assay with 150 to 650 compounds available.

For each program, we chronologically split the compounds, using the first 75% of data for training and the final 25% as a test set, approximating the split used for the Polaris-ASAP competition. We then examined the performance of our two baseline models on the log MAE2 metric used as the primary metric for the competition, as well as Spearman r as a supplementary metric, since this captures a qualitatively different aspect of model performance.

The results of this exercise are shown below, starting with HLM. The descriptor and fingerprint model performance in terms of MAE and Spearman r is shown for each of the 10 programs, with the programs ordered by descriptor MAE for legibility. The dotted horizontal lines indicate the MAE and Spearman r achieved by each model on the ASAP dataset.

Performance of descriptor and fingerprint baselines to predict HLM across several programs. Models were evaluated by MAE (log scale) and Spearman r. Error bars show a bootstrapped standard error. Statistically significant differences between the two baselines are indicated by an asterisk above the program number.

Our first finding is that the differential in performance between models depends a lot on the program. In HLM, the fingerprint-based model achieves lower MAE than the descriptor-based method in 8 out of 10 programs, though the performance gap varies from as low as 4% to as large as 24%. There are also 2 programs where the descriptor model performs better, with statistically significant differences at the 0.05 level occurring in both directions (we’ve omitted adjustment for multiple comparisons in order to replicate the tests that would be performed within a single-program analysis). This means that conclusions about how much better the best method is than alternatives, as well as which method is truly best, can be most accurately answered by comparing across a range of programs.

Examining the MDR1-MDCK and KSOL results highlights a second take-away, which is that the overall performance of all methods can vary meaningfully across programs, and this can point to nuances specific to that program. In HLM, we saw that the MAE and Spearman r achieved by our baseline models on the Polaris-ASAP competition are fairly typical compared to those achieved on the ten consortium programs. By contrast, in MDR1-MDCK and KSOL the Polaris-ASAP results are more extreme on some metrics, with the ASAP dataset yielding close to the lowest MAE in KSOL, and by far the highest Spearman r in MDR1-MDCK.

Performance of descriptor and fingerprint baselines to predict MDR1-MDCK Papp and KSol across several programs. Models were evaluated by MAE (log scale) and Spearman r. Error bars show a bootstrapped standard error. Statistically significant differences between the two baselines are indicated by an asterisk above the program number.

Looking more closely at the ASAP test set can clarify these patterns. The test set is composed of two distinct series, and for MDR1-MDCKII, these series behave quite differently. The larger spirocyclic series has permeabilities ranging widely from <0.1 to 30 µcm/s, while the smaller series has permeabilities that are mostly clustered around 25-30 µcm/s. Since both series appear in the training data as well, this makes it relatively easy for the model to rank the permeability of compounds, and may be driving the high performance.

For KSOL, a different explanation is at play. Here, most of the data from both series is clustered near 400µM, which is the ceiling of the assay. This makes it easy for a model to make predictions with low average error, but hard to rank compounds, explaining the low MAE paired with low Spearman r.

Finally, we can compare these to HLM, which has a more even spread of measured values across both subseries, in line with its more typical performance numbers.

‍

Distribution of measured values of MDR1-MDCKII Papp, HLM, and KSol in the competition test set, showing compounds in series 1 (red) series 2 (orange) and other (blue). The distribution of data differs markedly across properties and between scaffolds.

‍

The nuances observed in the ASAP data are a strength of the dataset, since these kinds of nuances are what happens in every drug program. This is one of the reasons we believe there are major advantages to using a program-based benchmarking and model-comparison approach. But since the nuances differ with every program, continuing to evaluate methods with additional programs is critical.

‍

Conclusions

It’s been a lot of fun participating in the first Polaris ADMET competition and poring through the results. Kudos to Polaris and ASAP for putting the competition together. It has highlighted the value of bringing in external task-specific data to ADMET modeling, the work remaining to improve large-scale pretrained models for chemistry, and the fact that every program-based benchmark will come with nuances specific to that program.

Given that every program is unique, we’d like to keep seeing more competitions like this. Polaris and ASAP have done a huge service to the community by choosing to disclose this data via a competition format and by putting in the work to manage the competition. We’d love to see a series of ADMET, potency prediction, and pose prediction challenges, much like the ongoing CACHE series of hit-discovery challenges. By comparing modeling approaches in a blinded manner across diverse programs, the field can continue to refine its understanding of what works in practice.

1 These programs contain a mix of MDCK cell lines.

2 After the initial interim results, the Polaris ASAP competition switched to calculating MAE with a Log1p transform rather than a standard log transform. While this makes it easier to deal with low value cutoffs in the context of a competition, it distorts and underestimates the log MAE of programs where many compounds have values below 1, so we’ve used standard log MAE for this cross-program comparison. Polaris ASAP competition baseline results are presented in standard log MAE units in this section as well.