Reference paper

What I think: these models work great if you fine-tune them for specific tasks. But in real biological research, this isn’t applicable. I mean most of the data used in real-world research is unlabeled and exploratory with the goal of finding something novel, not classifying something known.

The 2025 Kezierska et al. paper found that popular single-cell models (scGPT, Geneformer) struggle with two main tasks during zero-shot training:

  • they have trouble separating different cell types
  • they fail at batch integration. This, in my opinion, pretty much renders them useless in their current state : /

Hallucinations exist too; when predictions couldn’t be made, it often made up or used average values. This resulted in only modest correlation when comparing reconstructed gene expression patters with ground-truth.

They hold promise, but the vision for a universal model is still quite far off. The next project I work on (Xu lab) may involve working on solutions, which I’m looking forward to.

Some solutions I thought about:

  • For batch integration failures:

    • baseline methodslike scVI inherently incorporate batch labels during training. FOundation models don’t do this, as they try to learn batch correlation indirectly through exposure to diverse data. Maybe adding explicit batch awareness could help.
    • Train on more curated datasets - scGPT which trained on 33 million cells performed worse than versions trained on samller, tissue-specific datasets.
    • a new architecture entirely lol
    • transfer knowledge from proteins, as proteins have more established AI models
    • improve pretraining. The authors themselves suggested that MLM may not be the right fit.
  • Broader solutions:

    • I think we have to shift the focus on zero-shot performance from the start, rather than optimizing fine-tuning performance and hoping zero-shot works.
    • Create standardized benchmarks reserved exclusively for model evaluation that should never be used to pretrain. No trianing/test splits, just an honest assessment of zero-shot performance.
    • Better ways to verify learning. The paper discussed how models couldn’t even perform their own pretraining task well which is definitely a problem.