Seeing the Cell, Not the Molecule: How Microscopy Images Are Reshaping Drug Discovery

The dominant logic of pre-screen compound selection is structural: find molecules that look like your positive control, and you are more likely to find molecules that act like it. This logic is intuitive, computationally tractable, and wrong often enough to matter. A research team spanning Institut Curie and the École Normale Supérieure has now built a rigorous case for a different approach, one that asks not what a compound looks like, but what it does to a living cell.

Their method, validated across 65 high-throughput assays covering both cell-based and biochemical readouts, consistently doubles or exceeds the hit rate of conventional selection strategies. More than a screening tool, it functions as a lens onto the deep structure of chemical biology, revealing that structurally unrelated compounds can converge on the same cellular outcome, and that nearly identical molecules can produce radically different ones.

The Limits of Chemical Intuition

Phenotypic drug discovery, the practice of selecting compounds based on observable cellular effects rather than predicted binding to a single protein target, has a long and productive history. Growing evidence suggests it is more successful than target-based approaches at delivering first-in-class drugs, partly because it remains agnostic about mechanism. A compound that fixes a broken cellular process is valuable whether or not you know exactly how it does so.

The Cell Painting assay, introduced in 2013 and standardized in 2016, operationalizes this idea at scale. Cells are stained with fluorescent dyes that label nuclei, endoplasmic reticulum, nucleoli, actin, Golgi apparatus, plasma membrane, and mitochondria. High-content imaging then captures the morphological fingerprint of each compound's effect. The resulting images encode a holistic view of cell state, one that reflects functional outcomes regardless of the structural identity of the perturbation that caused them.

The JUMP-CP dataset, assembled by a consortium of 12 laboratories, brought this approach to an unprecedented scale: over 136,000 chemical and genetic perturbations, more than 5 million images, and 112,480 unique compounds. The dataset is publicly available. It is also, as the authors discovered, extremely difficult to use in its raw form.

What is Cell Painting? A multiplexed fluorescence imaging assay that stains six cellular compartments simultaneously, producing a high-dimensional morphological fingerprint for each compound tested. The JUMP-CP dataset contains over 5 million such images from 112,480 compounds, acquired across 12 laboratories worldwide.

Taming 112,480 Compounds Across Twelve Laboratories

When data arrives from 12 different labs, acquired on different microscopes, processed in different batches, the non-biological variation can swamp the biological signal. This is the batch effect problem, and it is not trivial. Before any compound comparison is meaningful, you need to know that a difference in phenotypic profile reflects a difference in biology, not a difference in which plate reader was used on which Tuesday.

The team ran a systematic sweep of more than 5,000 combinations of feature extraction methods, normalization strategies, and replicate aggregation approaches. Two metrics guided the search: mean average precision (mAP) for retrieving replicates of the same compound across different plates (biological signal preservation), and mAP for retrieving wells from the same experimental plate (batch effect, where a low score is desirable).

Optimal PipelineFeature extraction with DINOv2-Giant (a 1.3-billion-parameter self-supervised vision transformer pretrained on natural images), followed by plate-wise whitening fitted to DMSO negative control wells, then an inverse normal transform applied feature-wise. This combination outperformed domain-specific encoders including OpenPhenom and ChAda-ViT, as well as handcrafted CellProfiler features, across both evaluation metrics.

The winning combination was not the obvious one. DINOv2-Giant, a model trained on natural photographs and never exposed to microscopy data during pretraining, outperformed encoders specifically designed and trained on cell images. This is a finding worth sitting with. The model's 1.3 billion parameters appear to have learned representations general enough to transfer to a domain it was never designed for, and to do so better than purpose-built alternatives.

After normalization, UMAP projections of control wells from 9 laboratories show a clear transformation: before processing, wells cluster tightly by laboratory of origin, a textbook batch effect. After the whitening and inverse normal transform pipeline, those clusters dissolve, while positive control replicates from different labs converge. The biological signal survives; the technical noise does not.

Fig. 1. Building robust phenotypic profiles from JUMP-CP. Panel (b) shows the results of sweeping more than 5,000 pipeline combinations, plotting biological signal preservation (x-axis) against... — Fig. 1. Building robust phenotypic profiles from JUMP-CP. Panel (b) shows the results of sweeping more than 5,000 pipeline combinations, plotting biological signal preservation (x-axis) against batch effect removal (y-axis). The selected pipeline using DINOv2-Giant sits in the optimal bottom-right quadrant. Panel (d) shows UMAP projections of control wells before and after normalization: laboratory-specific clustering dissolves, while positive control replicates from different labs converge.

Sixty-Five Screens, One Consistent Answer

With a robust phenotypic representation in hand, the selection method itself is conceptually simple. Given a positive control compound with a known desired bioactivity, rank all 112,480 JUMP-CP compounds by cosine similarity to that control's phenotypic profile. Select the top 5%. Screen those.

To test whether this works, the team applied it retrospectively to 65 existing screens: 49 from ChEMBL, 16 from the Institut Curie BioPhoenix platform, and 5 targets from the Lit-PCBA benchmark, which is specifically designed to assess target-dependent effects rather than promiscuous or off-target activity. For each screen, they computed the normalized enrichment factor (nEF), defined as the enrichment factor achieved divided by the theoretical maximum possible enrichment factor. A score of 100% means every active compound was selected.

Two scenarios were evaluated. In the pessimistic scenario, the nEF was averaged across all known hits used as positive controls, including weaker or less specific ones. In the optimistic scenario, the best-performing positive control was used, reflecting a realistic experimental setup where a well-characterized tool compound anchors the selection. Under the pessimistic scenario, phenotypic selection matched or exceeded the hit rate of the original screen in most cases. Under the optimistic scenario, the method consistently at least doubled the proportion of active compounds recovered, and in several cases approached the theoretical maximum.

The Lit-PCBA results are particularly telling. That benchmark was constructed specifically to be resistant to false positives and off-target enrichment. Strong performance there confirms that the method is not simply capturing compounds with broad cytotoxic activity or other non-specific effects, a genuine risk in phenotypic approaches.

Fig. 2. Phenotypic selection boosts hit rates across 65 screens. Green bars show mean normalized enrichment factor (nEF) across all positive controls; blue bars show the maximum nEF achieved with... — Fig. 2. Phenotypic selection boosts hit rates across 65 screens. Green bars show mean normalized enrichment factor (nEF) across all positive controls; blue bars show the maximum nEF achieved with the best positive control. Purple bars show the baseline hit rate of the original screen. In nearly every case across ChEMBL (49 screens), Lit-PCBA (5 targets), and Institut Curie (16 screens), the phenotypic method substantially outperforms the original selection.

Finding What Structure-Based Methods Cannot See

The enrichment results alone would justify the method. What makes this paper more interesting is what the large-scale analysis reveals about the relationship between chemical structure and biological function.

When the team measured the structural similarity (using Tanimoto similarity on Morgan fingerprints, a standard cheminformatics measure) between positive controls and their phenotypically selected neighbors, the values were consistently low. Across 16 Institut Curie screens, 49 ChEMBL screens, and a random sample of 10,000 JUMP-CP compounds used as queries, the structural similarity between phenotypic neighbors was statistically indistinguishable from the structural similarity between a compound and the library at large. Phenotypic selection does not preferentially retrieve structural analogs. It retrieves functional analogs, which are a different and often more valuable thing.

This matters for drug discovery in practical terms. Structurally diverse hits are more likely to have distinct intellectual property profiles, distinct ADME-toxicity characteristics, and distinct opportunities for optimization. A screening campaign that returns only close analogs of the positive control has, in a real sense, told you nothing you did not already know.

Fig. 3. Phenotypic selection recovers structurally diverse compounds. Panel (a) shows that the structural similarity between phenotypically selected compounds (green) and the positive control is... — Fig. 3. Phenotypic selection recovers structurally diverse compounds. Panel (a) shows that the structural similarity between phenotypically selected compounds (green) and the positive control is no higher than the library average (blue), and far lower than structurally selected compounds (red). Panel (b) visualizes this in UMAP space: compounds close to the positive control in phenotypic space (blue) are scattered across chemical space. Panels (c) and (d) show specific examples of structurally diverse compounds with similar phenotypes, and structurally similar compounds with divergent phenotypes.

Why chemical diversity matters: Structurally diverse hits are more likely to have distinct intellectual property profiles and distinct ADME-toxicity characteristics. A campaign that returns only close analogs of the positive control has, in a real sense, told you nothing new.

Activity Cliffs: When One Atom Changes Everything

The observation that structurally similar compounds can produce very different phenotypes is not merely a curiosity. It points to a systematic phenomenon that the large scale of JUMP-CP makes tractable to study for the first time.

The team grouped all JUMP-CP compounds by their Bemis-Murcko scaffold, the molecular core that defines a chemical series in medicinal chemistry. This produced 532 distinct series with six or more members. Within each series, compounds were split into phenotypic subclusters using hierarchical clustering, and the intra-cluster versus inter-cluster phenotypic similarity was compared against the corresponding structural similarity.

Eighty-one series, comprising 2,277 compounds in total, showed substantially higher phenotypic similarity within clusters than between them, while showing minimal structural differences by the same metric. These are activity cliffs: cases where minor chemical changes produce profound phenotypic shifts. The threshold used was a cosine similarity difference of at least 0.2, corresponding to more than three standard deviations from the mean pairwise similarity across the entire JUMP-CP dataset.

A worked example makes the phenomenon concrete. Within one chemical series, three compounds (C1, C2, C3) kill cells. They share three specific structural features: a bromine atom, a methyl group, and an sp3 carbon chain linked to a nitrogen atom. Eight other compounds in the same series, differing in the protonation state of that nitrogen and in various substituents, produce a range of non-toxic phenotypes, some subtly distinct from the DMSO negative control, some indistinguishable from it. The scaffold itself, stripped of all substituents, produces no detectable phenotypic effect. The bioactivity lives in the decoration, not the core.

This is precisely the kind of information that quantitative structure-activity relationship (QSAR) models struggle to capture. Activity cliffs represent discontinuities in the structure-activity landscape, points where the smooth interpolation that machine learning models assume breaks down. Phenotypic profiling does not interpolate. It measures.

Fig. 4. Systematic identification of activity cliffs across JUMP-CP. Panel (c) shows intra- versus inter-cluster phenotypic similarity for 532 scaffold-based compound groups. The 81 groups... — Fig. 4. Systematic identification of activity cliffs across JUMP-CP. Panel (c) shows intra- versus inter-cluster phenotypic similarity for 532 scaffold-based compound groups. The 81 groups highlighted in red show substantially higher intra-cluster than inter-cluster phenotypic similarity. Panel (d) shows that for these same 81 groups, structural similarity is high both within and between clusters, confirming that small structural changes are driving large phenotypic shifts.

Fig. 5. A detailed activity cliff example. The heatmap (a) shows a clear split into two phenotypic clusters within a single chemical series. The three toxic compounds in cluster 1 (b) share a... — Fig. 5. A detailed activity cliff example. The heatmap (a) shows a clear split into two phenotypic clusters within a single chemical series. The three toxic compounds in cluster 1 (b) share a bromine atom, methyl group, and aliphatic chain (highlighted green). The non-toxic compounds in cluster 2 (c) differ in nitrogen protonation state and substituents (purple highlights), producing a range of distinct, non-lethal phenotypes.

Pathways Written in Phenotype

Beyond individual compound pairs, the dataset is large enough to ask a more ambitious question: do compounds that target the same biological pathway produce related phenotypes, even when they act on different proteins within that pathway?

To test this, the team cross-referenced 30 biological pathways from the MSigDB Hallmarks gene sets against compound-target interaction data from BindingDB, retaining only pathways where at least 10 distinct genes were targeted by compounds present in JUMP-CP. For each pathway, they computed pairwise cosine similarities between all compounds known to target pathway genes, then compared that distribution against pairwise similarities between randomly selected non-targeting compounds.

The result was consistent across all 30 pathways, with statistical significance at p < 0.00001 in every case. Compounds targeting the same pathway show a bimodal distribution of phenotypic similarity: they are more likely than random pairs to be either highly similar (cosine similarity above 0.2, more than three standard deviations above the dataset mean) or highly dissimilar (cosine similarity below -0.2, more than three standard deviations below the mean). Random compound pairs cluster near zero.

The EGFR pathway example illustrates the convergent case. Using Dezmapimod, a non-specific MAPK inhibitor, as a positive control for a screen of EGFR pathway inhibitors, the top 5% of phenotypically similar JUMP-CP compounds included AG1478 (an EGFR inhibitor), Sb-202190 (a BRD4 modulator), and SP600125 (a MAPK8 inhibitor). Three structurally unrelated compounds, acting on three different proteins, producing a similar enough cellular phenotype to cluster together. The pathway is the organizing principle, not the target.

The G2M checkpoint pathway illustrates the divergent case. Reversine, which inhibits AURKB (itself an inhibitor of the pro-survival protein BIRC5), produces a pro-proliferative outcome. NSC-625987, which inhibits CDK4 directly, blocks proliferation. Both target the G2M pathway. Their phenotypic profiles have a cosine similarity of -0.51, nearly opposite. The network logic explains the phenotypic logic: one compound removes a brake on proliferation, the other applies one.

Fig. 7. Pathway membership predicts phenotypic relationships. Panel (a) shows the bimodal distribution of phenotypic similarity for 530 compounds targeting G2M pathway genes (green violin)... — Fig. 7. Pathway membership predicts phenotypic relationships. Panel (a) shows the bimodal distribution of phenotypic similarity for 530 compounds targeting G2M pathway genes (green violin) compared to random compound pairs (red violin). Compounds targeting the same pathway are far more likely to produce either very similar or very opposite phenotypes. Panel (c) quantifies this for four specific G2M-targeting compounds: Reversine and NSC-625987 have a cosine similarity of -0.51, reflecting their opposing roles in the pathway network shown in panel (d).

What the Method Cannot Do

The limitations here are real and worth naming directly. The entire analysis rests on U2OS cells, a human osteosarcoma line. Compounds targeting genes not expressed in U2OS will not produce informative phenotypic profiles, and the selection will miss them. This is not a flaw in the method so much as a constraint of the dataset, but it means that the approach is not universally applicable without additional cell line coverage.

The compound library is also bounded. 112,480 compounds is a large number by academic standards, but it represents a small fraction of drug-like chemical space, estimated at 10 to the 33rd power compounds. The method can only select from what is in JUMP-CP. A cross-modal model that predicts phenotypic profiles from chemical structure could extend the reach, and the authors note they have trained such a model, though it has not yet been evaluated as a selection tool.

The batch effect correction, while clearly effective by the metrics used, was validated on control wells. Whether the normalization preserves subtle biological signals in sample wells with the same fidelity is a harder question to answer, and one the paper does not fully resolve. Having worked through batch correction challenges in large-scale flow cytometry cohorts, I would want to see the normalization stress-tested on compounds with known weak phenotypes before trusting it for edge cases.

A Tool for the Community

The practical output of this work is Phenoseeker, a web tool available at phenoseeker.bio.ens.psl.eu. A researcher with a positive control compound can submit its name and retrieve, in a single query, a ranked list of JUMP-CP compounds most likely to share its bioactivity. No machine learning training required. No database of known actives required. One compound in, a prioritized library out.

The deeper contribution is conceptual. Phenotypic profiling has long been positioned as a complement to structure-based approaches, useful when you do not know your target, less precise when you do. This work repositions it as a primary framework for navigating chemical space, one that captures functional relationships that structural similarity cannot. The activity cliffs alone, 2,277 compounds across 81 chemical series where minor structural changes produce profound phenotypic shifts, represent a resource for understanding the atomic determinants of bioactivity that QSAR models are structurally unable to access.

The field has spent decades building ever more sophisticated models of the structure-activity relationship. This paper is a reminder that the activity is ultimately a cellular phenomenon, and that measuring it directly, at scale, in living cells, may be the most reliable guide we have.