NeurIPS 2023 Roundup: Drug-Target Interaction (DTI) Prediction

Data on drugs, drug targets, genetic regulatory networks, phenotypes, pathways, and more are proliferating. Pooling everything together to power DTI prediction tools is becoming more common.

Feb 12, 2024

Intro

Welcome to Part Three of Dimension’s NeurIPS recap series. I hope you’ll enjoy reading about a few papers on Drug-Target Interaction (DTI) prediction networks. If you missed them, you can also find the other two posts here:

Part One—Generative Protein Design
Part Two—Generative Chemistry
Part Three—Drug-Target Interaction (DTI) Prediction

Overview

Drug development is obviously extremely difficult. Drugs fail for many reasons including toxicity and lack of efficacy. Drug repurposing is a burgeoning technique to effectively bypass failures from toxicity. Conceptually, you can take already FDA-approved drugs that we know are safe and redirect them at diseases without treatments. Any efficacy in these scenarios is a net benefit.

More than 30% of FDA approved drugs gain at least one more disease indication following approval. Some have as many as ten indications. Drug repurposing isn’t new, yet it’s benefitting from fundamental advances in drug-target interaction (DTI) prediction networks—specialized models based on graph neural networks (GNNs). Here are some of my main takeaways from the talks:

Inductive models are apt for DTI prediction. Most diseases don’t have therapies or well known disease mechanisms. Inductive DTI prediction models are trained to generalize to these cases, which represents how they’d be used in the real world.
Train-test splitting and subsampling are complicated for DTI graphs. Preventing training data from leaking into the test set is difficult for graphs as splits can warp the overall graph structure. Meanwhile, there is a huge mismatch between positive (interaction) and negative (no interaction) data in medical knowledge graphs. Evening these up requires subsampling the negative data, but doing so isn’t straightforward.
Usability determines use. Though off-label prescriptions are increasingly common, clinicians don’t automatically trust AI black boxes to tell them which drugs to prescribe for which patients. Building intuitive and explainable infrastructure around a model is critical.

Let’s get to it.

Towards a More Inductive World

Jesus de la Fuente’s presentation on the hidden pitfalls of DTI graph neural networks (GNNs) was the first I heard on this subject at NeurIPS 2023. I appreciated this knowledge in my back pocket when grokking the other DTI network talks later that day. My biggest take-home messages revolved around the nuances of train-test splitting for graph data and how incorporating molecular information into the negative subsampling process could improve model performance.

The opening assertion was comparing GNNs for DTI prediction is challenging given the diversity of model architectures. These models generally fall into two categories: transductive and inductive GNNs. De la Fuente made a compelling case for the latter, though I’ll explain how these categories differ before parsing his team’s rationale for inductive architectures.

Transductive GNNs are aware of the entire graph’s structure during training—all the nodes and edges are represented. Some labels are hidden during training. These form the model’s test set.
Inductive GNNs are trained on a subset of the graph. These models focus on learning a function that helps them generalize to unseen nodes or entire graphs.

Intuitively, inductive models seem apt for DTI prediction because the universe of drugs and drug targets is inexorably expanding. We want our models to learn generalizable rules of why certain proteins and ligands interact to produce beneficial (or detrimental) physiological effects.

In their paper, the authors suggest that transductive DTI prediction models may overstate their performance due to issues with data leakage from improper train-test splitting. I mentioned in my previous NeurIPS blog how models trained on data closely resembling validation data are prone to memorization—exaggerating how they’ll perform on novel prediction tasks.

How exactly do you ‘split’ training and testing data when its arranged in a highly interconnected graph structure? It turns out this process is both science and art.

De la Fuente referenced a paper on hierarchical graph representation learning using differential pooling that outlines an advanced technique for graph-aware training. It’s elegant and fascinating so I’ll try not to butcher. Essentially, you start with the input graph—one that’s high resolution and filled with all nodes and edges. You run a GNN to obtain learned embeddings and then cluster those together into a new, pooled graph. Each new layer is slightly more coarse grained than the one below it, like a multi-tier wedding cake that gets pixelated on the way up.

If I understand correctly, this pooling method (DIFFPOOL) implicitly addresses some challenges with train-test splitting on GNNs. Via hierarchical clustering, DIFFPOOL helps the model learn useful representations that capture the underlying structure of the graph at various levels of granularity. This helps inductive models generalize to new, unseen classification tasks.

A second issue raised by de la Fuente et al. is biologically-informed subsampling. Subsampling is a technique wherein you train a model only on a subset of the total available training data. Though seemingly counterintuitive, there are several reasons you’d do this in the context of a DTI model:

Training can be expensive. Winnowing training data points can be more compute efficient.
If you have an imbalance in the training, rebalancing it via subsampling can even the see-saw—preventing the model from overfitting towards one answer.

The second point is the main issue with DTI networks. Training data on drug-target interactions is sparse and imbalanced. This means that there relatively few edges between nodes because most drugs don’t interact with most targets (negative edges). The implied next problem becomes—how do you choose which negative examples to toss out?

Diagram showing how under-sampling involves removing training examples from the majority class. (Source)

De la Fuente asserted that many papers randomly discard negative examples, which is an oversimplification of the DTI prediction task. Instead, his group proposes a novel method that uses structural similarity (via RMSD) to pair each positive drug-target interaction edge with a negative example. The method looks like this:

Find a positive DTI edge.
Calculate the RMSD between the positive drug-target structure and all other drug-protein structures in the network.
Using RMSD, bin each pairwise relationship into one of three buckets.
1. <2.5Å = discard since the protein may be too simple or small to align specifically
2. 2.5—5Å = hold these out to create a challenging validation set
3. 5—(6-20)Å = train on these as they’re similar enough to create difficult training data, but different enough to ensure true negativity. The authors experimented with capping the training set anywhere from 6 to 20 angstroms.

The authors retrained a popular inductive model (Moltrans) using their RMSD method, finding that it increased AUC versus random negative subsampling when tested on two datasets (BindingDB and Biosnap). Additionally, the group tested two novel associations in vitro, demonstrating that the revised model picked up on legitimate DTIs (see below). Finally, I’ll mention that the authors published a Python package called GraphGuest for DTI graph ablation studies, methods for negative edge selection, and access to the augmented DTI datasets.

Diagrams of cell viability when targets (EGFR and GSK3β) were combined with compounds (N69 and carbinoxamine), respectively. (Source)

Zero-Shot Repurposing w/ Clinician-Centric Design

Marinka Zitnik spoke on her group’s work developing TxGNN, a geometric deep learning model for zero-shot drug repurposing. Importantly, TxGNN is designed to perform well on diseases with few or no approved medicines. The model is housed in an interpretable framework that makes its predictions explainable to clinicians, a key differentiator between a model and a model that gets used in a prospective setting.

In this context, zero-shot means that TxGNN is trained to generalize to new drugs and diseases without additional fine-tuning on labeled data. The authors noted that of >17,000 known diseases, only 8% have indicated therapies—underscoring that most clinical use cases will involve zero-shot prediction. As an inductive model, TxGNN aligns with de la Fuente’s central thesis from earlier. Unlike his group’s retrained Moltrans model, however, TxGNN is built on a medical knowledge graph containing information on drugs, diseases, gene pathways, and more.

TxGNN has two main components: (a) a predictor module that returns the probability (0-1) that an arbitrary drug-disease pair will interact and (b) an explainer module that visualizes how predictions were made to encourage clinical usage. I’ll start by breaking down each step of the predictor module (below) and explaining what I think is going on.

Diagrammatical overview of TxGNN. (Source)

Signature Computation—Once given a query disease, TxGNN computes a signature vector that describes latent qualities of the disease and includes information from neighboring nodes.
Similarity Profiling—The model profiles nearby diseases by comparing their signature vectors, generating similarity scores(s₁, s₂, … s_n₎for each pairwise relationship. Similar diseases may share underlying genetic factors or phenotypic traits, for example.
Disease-Disease Recombination—TxGNN retrieves information (embeddings) from diseases similar to the query disease. These are combined into a single, larger embedding (h_i^sim). Conceptually, this is like borrowing information from related diseases to aggregate intelligence about the query disease.
Degree Gating—In parallel, the model generates embeddings (h_i) for the original query disease. Degree gating is a process where the model decides how much information it should pool from other, similar diseases. For example, well known query diseases might include less information from the disease-disease recombination vector than sparsely known diseases. TxGNN might also include less information from ‘far-away’ diseases compared to proximal ones.
Prediction—Finally, the model combines embeddings from the query disease and its recombined neighbors into a single embedding (h_i^{^}). TxGNN also generates drug embeddings (h_j). The drug and disease embeddings are used to calculate the pair’s probability of interacting.

The authors evaluated TxGNN by first comparing its ability to predict drug indications and contraindications for diseases with known treatments. In this setting, TxGNN performs in-line or marginally better than the next strongest baseline (e.g., BioBERT). However, zero-shot prediction is what TxGNN was trained to excel at. Indeed, the fold-improvement for this task is much more drastic. TxGNN boosts performance (AUPRC) by 19% and 24% for indication and contraindication prediction, respectively.

**Top (c)**—The authors used several held-out sets including known drug-disease pairs. TxGNN performs roughly as well as other SOTA models for indication and contraindication prediction; **Bottom (d)**—The authors used additional held-out sets containing diseases with masked out treatment data, enabling zero-shot comparison. TxGNN shines in this scenario. (Source)

Now let’s get into the explainer module, which I’ve continued to play around with all week (access it here). Performant models that go unused aren’t useful. Anticipating clinician skepticism, the authors engineered the TxGNN Explorer interface to succinctly present how drugs and diseases are connected. Clinicians can visually ‘hop’ their way from a disease to a drug in a fashion similar to what they’re used to.

I enjoyed hearing how the researchers prospectively evaluated TxGNN. To test the model’s novel drug-disease indication predictions, Zitnick et al. used off-label prescription data from >1 million adults in Mount Sinai Health System’s databases. Their logic was that recurring off-label drug-disease pairs might represent real indications. They found that TxGNN’s top (novel) drug-disease predictions were enriched 107% compared to the bottom 50 predictions (contraindications) in Mount Sinai’s data, as shown below. Importantly, clinicians also cited their relative confidence in TxGNN’s predictions.

**Left (e)**—A chart showing association strengths (Log-OR) between drugs and diseases in Mount Sinai’s medical records. A zero (0) Log-OR means no association (contraindication) while FDA-approved indications have a significantly enriched association; **Right (f)**—TxGNN’s top predictions for drug-disease indications occur much more frequently in Mount Sinai’s data than predicted contraindications. (Source)

Clinicians reported qualitatively positive experiences using TxGNN’s explainability features. (Source)

Thanks for reading! Subscribe below if you don’t want to miss Part 4 of our NeurIPS roundup. Feel free to reach out if you’re working in this space!

—Simon, Head of Research

Dimension Research

Discussion about this post