Seeing the Cell: How Microscopy Images Are Reshaping Drug Discovery at Scale

The dominant logic of early-stage drug discovery rests on a seductive assumption: that molecules resembling a known drug will behave like that drug. This principle, formalized as quantitative structure-activity relationship (QSAR) modeling, has guided compound library selection for generations. It is also, in a growing number of cases, demonstrably wrong. Two molecules can be nearly identical on paper and produce entirely different effects inside a cell. Two molecules can look nothing alike and converge on the same biological outcome through completely different mechanisms. A new study from researchers at Institut Curie and ENS Paris-PSL confronts this tension head-on, deploying cellular morphology as a biological compass across the largest publicly available compound imaging dataset ever assembled.

Fig. 1. Building robust phenotypic profiles from JUMP-CP. Panel (b) benchmarks over 5,000 pipeline combinations on two axes: biological signal preservation (x-axis) and batch effect removal... — Fig. 1. Building robust phenotypic profiles from JUMP-CP. Panel (b) benchmarks over 5,000 pipeline combinations on two axes: biological signal preservation (x-axis) and batch effect removal (y-axis). The selected pipeline, using DINOv2-Giant encoding with whitening normalization, sits in the optimal bottom-right quadrant. UMAP plots in panel (d) show data from 9 laboratories before normalization (left, clustered by lab) and after (right, integrated by biology).

The Limits of Chemical Intuition

Phenotypic drug discovery, the practice of selecting compounds based on what they do to cells rather than what they look like chemically, has a long and productive history. Evidence suggests it is more successful than target-based approaches at delivering first-in-class medicines, precisely because it makes no assumptions about which molecular target matters. The challenge has always been scale: how do you systematically compare the cellular effects of hundreds of thousands of compounds when those experiments were run in a dozen different laboratories, on different microscopes, in different batches?

The JUMP-CP dataset, assembled by a consortium of 12 laboratories and released in 2023, offered a potential answer. It contains Cell Painting images for 112,480 unique chemical compounds, each profiled in human U2OS osteosarcoma cells using five fluorescent channels that label nuclei, endoplasmic reticulum, nucleoli, actin and Golgi apparatus, and mitochondria. The scale is genuinely unprecedented. The problem is that the dataset's very breadth is also its greatest liability: non-biological variability from different imaging systems and experimental batches, what the field calls batch effects, can swamp the biological signal you are trying to measure.

Taming 112,480 Compounds Across 12 Laboratories

The first and most technically demanding contribution of this work is a pipeline that makes the entire JUMP-CP dataset usable for pairwise compound comparison. The authors swept more than 5,000 combinations of feature extraction methods, normalization strategies, and replicate aggregation approaches, evaluating each on two criteria simultaneously: how well it preserved biological signal (measured by mean average precision, or mAP, for eight positive control compounds present on every plate) and how effectively it removed batch effects (measured by mAP on plate labels, where a low score means profiles from the same plate are no more similar to each other than to profiles from different plates).

Key PipelineRaw Cell Painting images are encoded using DINOv2-Giant, a self-supervised Vision Transformer with 1.3 billion parameters originally trained on natural images. Per-well profiles are then whitened using a matrix fit to DMSO negative control wells on each plate, followed by an inverse normal transform (INT) to standardize feature distributions. Replicate wells are aggregated into a single phenotypic profile per compound. Pairwise comparison uses cosine similarity.

The winning combination was not the obvious one. DINOv2-Giant, a model that has never seen a Cell Painting image during training, outperformed both OpenPhenom (a masked autoencoder trained explicitly on Cell Painting data) and ChAda-ViT (a channel-adaptive transformer designed for microscopy). It also outperformed handcrafted features from CellProfiler. The normalization step proved equally critical: before applying the whitening transform, profiles from the same laboratory clustered tightly together in UMAP space, reflecting batch structure rather than biology. After normalization, those clusters dissolved, and positive control replicates from different laboratories converged, confirming that the biological signal had been preserved while the technical noise was suppressed.

Pipeline at a glance
5,000+ combinations tested. DINOv2-Giant + whitening normalization + inverse normal transform selected as optimal. Validated on ~600 plates spanning all 12 JUMP-CP laboratories.

Doubling Hit Rates Across 65 Real Screens

With a robust representation in hand, the compound selection method itself is conceptually simple. Given a positive control compound with a known biological effect, rank all 112,480 JUMP-CP compounds by cosine similarity to that control's phenotypic profile. Select the top 5%. The hypothesis is that compounds producing a similar cellular morphology are likely to share similar biological activity, even if they look nothing alike chemically.

Testing that hypothesis required real data. The authors assembled 65 high-throughput screening datasets: 16 from the Institut Curie's BioPhoenix platform (all cell-based, imaging or fluorescence readouts), 49 from ChEMBL (covering both in vitro and in cellulo systems), and 5 targets from the Lit-PCBA benchmark, which is specifically designed to resist false positives. For each screen, they computed the normalized enrichment factor (nEF), defined as the enrichment factor at 5% selection divided by the theoretical maximum enrichment factor. A score of 100% means every active compound was captured in the top 5%.

Fig. 2. Phenotypic selection boosts hit rates across 65 diverse screens. Green bars show mean normalized enrichment factor (nEF) across all positive controls; blue bars show the maximum nEF when... — Fig. 2. Phenotypic selection boosts hit rates across 65 diverse screens. Green bars show mean normalized enrichment factor (nEF) across all positive controls; blue bars show the maximum nEF when the best positive control is used. Purple bars show the baseline hit rate of the original screen selection. In nearly every case, the phenotypic method substantially outperforms the baseline.

The results are consistent and striking. Under a pessimistic scenario, where the nEF is averaged across all identified hits (including weaker or off-target ones used as positive controls), the phenotypic selection matched or exceeded the original screen's hit rate in most cases. Under a more realistic scenario, where a well-characterized tool compound with a strong phenotype serves as the positive control, the method consistently at least doubled the proportion of active compounds recovered compared to the original selection. The Lit-PCBA results are particularly telling: because that benchmark is designed to penalize methods that capture off-target or non-specific signals, strong performance there confirms that the enrichment reflects genuine target-relevant bioactivity, not a generic stress response or cytotoxicity artifact.

The practical implication is direct. If you are planning a phenotypic screen and you have a single well-characterized positive control, you can use this approach to pre-select a subset of the JUMP-CP library that is substantially enriched for active compounds before you run a single assay.

The Structural Diversity Dividend

One of the more consequential findings concerns what the selected compounds look like chemically. When the authors measured the Tanimoto similarity (a standard metric for comparing molecular fingerprints, where 1.0 means identical and 0.0 means no shared features) between a positive control and its top 5% phenotypic neighbors, the values were consistently low and statistically indistinguishable from the similarity between the positive control and the entire screened library. In other words, phenotypic selection does not preferentially retrieve structurally similar compounds. It retrieves functionally similar ones, wherever they happen to sit in chemical space.

Fig. 3. Phenotypic selection yields structural diversity. Panel (a) shows that the Tanimoto similarity between a positive control and its top 5% phenotypic neighbors (green) is indistinguishable... — Fig. 3. Phenotypic selection yields structural diversity. Panel (a) shows that the Tanimoto similarity between a positive control and its top 5% phenotypic neighbors (green) is indistinguishable from the library average (blue), and far lower than the top 5% structural neighbors (red). UMAP plots in panel (b) confirm that phenotypically selected compounds (blue) occupy different regions of chemical space from the positive control (red).

This matters for reasons beyond scientific curiosity. Structurally diverse hits are more likely to have distinct intellectual property profiles, reducing freedom-to-operate concerns. They may also have different ADME-toxicity characteristics, giving medicinal chemists more starting points for optimization. A structure-based selection, by contrast, tends to cluster hits around the positive control's scaffold, narrowing the chemical series available for follow-up.

The flip side of this finding is equally important. Among the compounds the method identifies as phenotypically similar to a positive control, some are structurally distant. Among compounds that are structurally close to each other, some produce radically different phenotypes. That second observation is the entry point to the paper's most technically rich contribution.

Activity Cliffs: When One Atom Changes Everything

An activity cliff, in the traditional QSAR sense, is a pair of structurally similar compounds with a large difference in potency against a specific target. The concept is well established but notoriously difficult to predict computationally, because the structural features that drive the cliff are often subtle and context-dependent. This paper extends the concept to phenotypic space, asking whether compounds sharing a common chemical scaffold can produce detectably different cellular morphologies.

The analysis grouped all 112,480 JUMP-CP compounds by their Bemis-Murcko scaffold (the molecular core, stripped of peripheral substituents), yielding 532 distinct chemical series with six or more members. Within each series, compounds were split into two phenotypic clusters using hierarchical clustering of their cosine similarity profiles. The key question was whether the intra-cluster phenotypic similarity was substantially higher than the inter-cluster similarity, while the structural similarity remained high in both cases. That pattern, high phenotypic divergence with low structural divergence, is the signature of an activity cliff.

Fig. 4. Systematic identification of activity cliffs. Panel (c) plots intra-cluster versus inter-cluster phenotypic similarity for 532 scaffold groups. The 81 groups highlighted in red sit well... — Fig. 4. Systematic identification of activity cliffs. Panel (c) plots intra-cluster versus inter-cluster phenotypic similarity for 532 scaffold groups. The 81 groups highlighted in red sit well below the diagonal, meaning their phenotypic clusters are internally coherent but mutually distinct. Panel (d) shows that for these same 81 groups, structural similarity is high both within and between clusters, confirming that small chemical changes are driving large phenotypic shifts.

Eighty-one scaffold groups, encompassing 2,277 compounds in total, met this criterion. A worked example makes the finding concrete. Within one chemical series, three compounds (C1, C2, C3) produced a strikingly similar phenotype: cells appeared dead or dying, clearly distinct from the DMSO negative control. These three compounds share three specific structural features: a bromine atom, a methyl group, and an sp3 carbon chain linked to a nitrogen atom. The remaining eight compounds in the series (C4 through C11) produced different phenotypes, both from the toxic cluster and from each other. Compound C9, the bare Murcko scaffold with no substituents, produced a phenotype essentially indistinguishable from DMSO. Adding different chemical groups to that scaffold produced subtle but measurable phenotypic shifts, each distinguishable by the profiling pipeline.

Fig. 5. A worked example of a phenotypic activity cliff. The heatmap (a) shows a clear split between three toxic compounds (C1-C3, top-left block) and the rest of the series. Panel (b) highlights... — Fig. 5. A worked example of a phenotypic activity cliff. The heatmap (a) shows a clear split between three toxic compounds (C1-C3, top-left block) and the rest of the series. Panel (b) highlights the shared structural features of the toxic cluster (bromine, methyl group, aliphatic chain, shown in green). Panel (c) shows the non-toxic compounds, where different substitutions on the protonated nitrogen (purple highlights) produce distinct sub-phenotypes.

Activity cliffs at scale
81 scaffold groups (2,277 compounds) show high intra-cluster phenotypic similarity with low inter-cluster similarity, while structural similarity remains high across clusters. The threshold for calling a cliff: intra-cluster cosine similarity exceeds inter-cluster by at least 0.2, corresponding to more than 3 standard deviations from the dataset mean.

The significance of this finding for medicinal chemistry is direct. QSAR models, by design, predict activity from structure. When a small structural change produces a large biological effect, those models fail, because they have no way to represent the discontinuity. Phenotypic profiling captures the discontinuity directly, in the morphology of the cell. The authors show that these cliffs encode atom-level determinants of bioactivity that structure-based models cannot access. That is not a modest claim, and the data across 81 scaffold groups support it.

Pathways, Convergence, and Opposition

The final major finding moves from individual compounds to biological systems. The question is whether compounds targeting different proteins within the same signaling pathway tend to produce related phenotypes. The intuition is reasonable: if two compounds both perturb the same cellular process, even through different molecular mechanisms, the downstream morphological consequences might be similar. But the reality turns out to be more complex, and more interesting.

The authors cross-referenced 30 biological pathways from the MSigDB Hallmarks gene set collection with compound-target interaction data from BindingDB, retaining only pathways where at least 10 distinct genes were targeted by at least 50 JUMP-CP compounds. For each pathway, they computed pairwise cosine similarities among all pathway-targeting compounds and compared the distribution to pairwise similarities among randomly selected non-targeting compounds.

Fig. 6. Functional convergence in the EGFR pathway. Panel (a) shows the UMAP of phenotypic profiles from a screen for EGFR pathway inhibitors. The positive control (Dezmapimod, red) and... — Fig. 6. Functional convergence in the EGFR pathway. Panel (a) shows the UMAP of phenotypic profiles from a screen for EGFR pathway inhibitors. The positive control (Dezmapimod, red) and phenotypically selected compounds (blue) are shown. Panel (c) maps the targets of three selected compounds onto the EGFR signaling network: AG1478 inhibits EGFR, Sb-202190 inhibits BRD4, and SP600125 inhibits MAPK8. All three produce similar phenotypes despite acting at different nodes.

The result, consistent across all 30 pathways with statistical significance at p < 0.00001 by Fisher's exact test, is that pathway-targeting compound pairs are substantially more likely to show extreme phenotypic similarity values, both very high and very low, than random pairs. The distribution is bimodal. Compounds in the same pathway are not just more similar to each other; they are also more likely to be opposite to each other.

The G2M checkpoint pathway illustrates why. Reversine, which inhibits AURKB (itself an inhibitor of BIRC5), produces a pro-proliferative outcome. NSC-625987, which inhibits CDK4 directly, blocks proliferation. Both compounds target the G2M pathway. Their cosine similarity is -0.51, meaning their phenotypic profiles are nearly mirror images of each other. The network logic explains the observation: one compound removes a brake on proliferation, the other applies one. Same pathway, opposite effects, opposite morphologies.

Fig. 7. Pathway membership predicts phenotypic relationships. Panel (a) shows the bimodal distribution of phenotypic similarities among G2M pathway-targeting compounds (green) versus random pairs... — Fig. 7. Pathway membership predicts phenotypic relationships. Panel (a) shows the bimodal distribution of phenotypic similarities among G2M pathway-targeting compounds (green) versus random pairs (red). Panel (c) quantifies the relationships among four specific compounds: Reversine and NSC-625987 have a cosine similarity of -0.51, reflecting their opposing roles in the network shown in panel (d).

The EGFR pathway example shows the convergent side of the same phenomenon. Using Dezmapimod, a non-specific MAPK inhibitor, as a positive control in a screen for EGFR pathway inhibitors, the phenotypic selection method retrieved three compounds with known activity in the pathway: AG1478 (targeting EGFR), Sb-202190 (targeting BRD4), and SP600125 (targeting MAPK8). These three compounds are structurally unrelated. They act at different nodes of the pathway. Yet they produce phenotypes similar enough to Dezmapimod that the cosine similarity ranking surfaced them from a pool of 112,480 candidates. This is functional convergence made visible in cellular morphology.

What the Data Cannot Yet Tell Us

The single-cell-line constraint deserves more attention than it receives in the paper. All JUMP-CP data comes from U2OS osteosarcoma cells. Compounds targeting genes not expressed or not functionally relevant in U2OS will produce flat phenotypic profiles, and the method will miss them entirely. The authors acknowledge this, but the practical scope of the blind spot is worth quantifying. A recent companion study mapped which pathways and gene families are detectable in U2OS morphology, and the answer is not all of them. For any drug discovery project targeting a tissue-specific pathway, the U2OS limitation is not a footnote; it is a design constraint that should drive the decision of whether to use this approach at all.

The restriction to 112,480 compounds is a related concern. That number sounds large, but it represents a vanishingly small fraction of drug-like chemical space, estimated at 10 to the 33rd power compounds. The method can only retrieve compounds already in the JUMP-CP library. A cross-modal model that predicts phenotypic profiles from chemical structure could extend the approach to novel compounds, and the authors note they have trained such a model based on CLIP architecture, but it has not yet been evaluated as a compound selection tool. That gap matters for practical deployment.

One methodological question worth raising: the choice of DINOv2-Giant as the optimal encoder is well-supported by the benchmark data, but the model has 1.3 billion parameters and was trained on natural images. The computational cost of running inference across 112,480 compounds is non-trivial, and the paper does not report inference time or hardware requirements. For a laboratory considering whether to build this pipeline internally, that information would be useful. The Phenoseeker web tool sidesteps the issue for retrieval tasks, but researchers who want to profile novel compounds and compare them to the JUMP-CP library will need to run the encoder themselves.

A New Lens on Chemical Space

The cumulative weight of this work is that phenotypic profiling, applied at the scale of 112,480 compounds across 12 laboratories, is not merely a complement to structure-based drug discovery. It accesses information that structure-based methods cannot, by definition, reach. Activity cliffs are invisible to QSAR models because those models predict from structure. Functional convergence across structurally unrelated compounds is invisible to similarity-based selection because those methods never look beyond the chemical scaffold. Cellular morphology captures both, simultaneously, without requiring any knowledge of the target.

The Phenoseeker web tool (phenoseeker.bio.ens.psl.eu) makes the core capability accessible without requiring any local infrastructure. A researcher with a positive control compound can retrieve the top phenotypic neighbors from the full JUMP-CP library in a single query. For academic groups running phenotypic screens with limited compound libraries, that is a meaningful practical resource.

"Phenotypic profiling emerges as a powerful strategy for prioritizing compounds, illuminating activity cliffs, and accelerating the identification of therapeutically relevant candidates across diverse biological contexts."Sanchez et al., Communications Biology, 2026

The longer-term trajectory points toward generative models conditioned on target phenotypic profiles, systems that could design new chemical structures predicted to produce a desired cellular morphology. The activity cliff data generated here would be a challenging training set for such a model, precisely because it encodes the discontinuities that make structure-to-phenotype prediction hard. That difficulty is also, from a scientific standpoint, the most interesting thing about it.