1. The data challenge in the pharma research
Drug discovery is becoming slower and more expensive. Eroom’s law –Moore’s Law in reverse – is that the cost of Research and Development (R&D) of all new drugs approved has risen exponentially over the last 60 years. One of the early steps in drug discovery is predicting the absorption, distribution, metabolism, excretion, and toxicity (ADMET) properties of candidate drug molecules. ADMET prediction is essential for understanding how drugs are absorbed, distributed, metabolized, and eliminated within the body, which is crucial for determining the pharmacokinetics and safety of potential drug candidates.
Machine learning has become indispensable to predicting these properties. However, the databases containing the chemical structure of compounds and their corresponding ADMET properties are typically proprietary and a closely guarded secret. Federated Machine learning has shown promising results due to its ability to address challenges related to privacy, data decentralization, and collaboration across multiple institutions to realize more accurate and generalizable models.
Abbvie is one of the top biopharmaceutical companies in the world, known for its R&D of innovative drugs in immunology, oncology, neuroscience, and virology. Abbvie and Intel explored using Federated Graph Neural Networks for a privacy-preserving and IP-protected way to train Graph NN models over distributed datasets, each with its privacy constraints.
2. Federated Learning: an effective and scalable tool
2.1 What is Federated Learning
Federated learning, a term originally coined by Google in 2016 , aims to solve the challenge of training an Artificial Intelligence (AI) model on data that cannot be moved from private infrastructure to a central server. In Federated Learning, the model is sent to the private data and combined to create a shared model by aggregating the locally computed updates. This approach has yielded up to 99% accuracy compared to a centralized learned model for the brain tumor segmentation task. The benefit of Federated Learning over centralized learning is that the private data never leaves the custody of its owner.
2.2 Open Federated Learning
Open Federated Learning (OpenFL ) is a distributed Python application and follows a client-server architecture. The server is an aggregator, keeps the latest model weights, and coordinates the tasks of all the clients that connect to it. Each data owner, also known as a collaborator, runs the client application that trains the global model on local, private data and sends the updates to the remote aggregator that aggregates the individual models into a global model. OpenFL was designed to be suitable for deployment in real-world healthcare settings, requiring high security and privacy. In OpenFL, all parties agree to a federated learning plan before training. This plan is represented by a YAML file that determines the model to be run, the training and validation tasks that will run on that model, and other information like the number of rounds to complete, and network configuration information. The model, training code, and plan are distributed out of band by the model owner to all parties before the experiment is launched. This allows the information that needs to be sent between aggregator and collaborator at runtime to be restricted to the floating-point weights of the deep learning model, significantly limiting the potential for remote code execution. All traffic between aggregators and collaborators is encrypted over the network through an mTLS connection. Read more about the detailed architecture here.
To prevent more advanced attacks, OpenFL can be run inside a Trusted Execution Environment (TEE), such as Intel Software Guard Extensions (SGX). TEEs protect the confidentiality and integrity of code and data while in use through hardware support and provide the ability to attest that the workload is running in a valid TEE and that the code was exactly what all parties agreed upon.
OpenFL was used in the Federated Tumor Segmentation (FeTS) initiative, the world’s largest real-world federation comprised of 71 institutions spread across 6 continents. The resulting federated model yielded a 33% improvement in Dice scores over an initial public model trained centrally.
However, OpenFL has never been used for Graph Neural Network (GNN) training. This blog post shares our research in this space along with Abbvie.
2.3 Graph Neural Networks (GNNs) over Federated Learning: exploring IntraGraph and InterGraph Learning
The scientific community recently started investigating how to apply Federated Learning to graph models. Some authors tried to systematically categorize the challenges encountered in Federated Graph Learning (FGL) to clarify a new field for which even a shared common language is still evolving. In “Federated Graph Learning - A Position Paper by Zhang et al.”, [2105.11099] Federated Graph Learning -- A Position Paper (arxiv.org) four types of FGL are proposed:
- Inter-graph FGL
- Intra-graph FGL (horizontal)
- Intra-graph FGL (vertical)
- Graph-structured FGL
The Inter-graph FGL is the most natural derivation of FL, where the global model performs a graph-level task, and each contributor brings data characterized by the same set of features.
Abbvie-Intel collaboration used inter-graph FGL to study the ADME properties of molecules represented as graphs.
Our goal was to establish the feasibility of GNN training in inter-graph settings using a simulated Federated Learning environment where the data was partitioned for each collaborator.
OpenFL has a special interface (workflow API) that allows users to simulate a Federated Learning scenario on a single machine with a single dataset. It supports simulating N collaborators driven by one aggregator, with each collaborator assigned a subset of the total dataset. We used this simulated distributed environment to determine how the results of training GNNs in inter-graph settings compare between centralized stand-alone training mode versus in an FL mode.
3. Using OpenFL with Pytorch_Geometric (PyG)
3.1 Introduction to PyG and Intel Optimization for PyG
PyTorch Geometric (PyG) is an extension library for PyTorch that simplifies the implementation of deep learning models for graphs and irregularly structured data. PyG bundles the state-of-the-art Graph Representation Learning by implementing layers, architectures, and recent research findings and recommendations. It is a Graph Neural Network framework suited for academia and industry.
PyG conceptually has 4 components:- A set of Graph-based Neural Network Building blocks to build customized models, useful for research and business purposes.
- A set of Graph Transformations and augmentations to deal with the input data.
- A collection of tutorials and examples.
Intel contributes to the development of PyG in many ways, including optimizations for data loading, data sampling, algorithmic optimization, and XPU support. Most notably, Intel has recently introduced Hierarchical Neighborhood sampling which accelerates PyG by reducing the computation necessary at each layer.
3.2 Using OpenFL with PyG
We used OpenFL workflow API to test our ideas and validate the application of Federated Learning for our pharmaceutical research problem. OpenFL allows independent entities to collaborate on AI research while preserving the privacy of each entity's data and protecting the model IP (Intellectual Property).
Considerable research in the pharmaceutical field revolves around molecules, which are often represented as graphs. This leads naturally to the idea of testing how well the training of Graph Neural Networks works in a Federated Learning scenario.
Together with AbbVie, we simulated the training of simple GNNs over a Federated Learning scenario, using OpenFL to implement the Federated Learning environment and PyG as a Graph Neural Network framework to drive the training of the GNN models.
Our experiments aimed to show computational consistency between the stand-alone/centralized training of our GNN models compared to training with the same data in a Federated Learning context that was simulated via the handy OpenFL workflow API. AbbVie scientists suggested that a good first indicator would be to investigate the training profile of some simple GNN models on 2 small datasets used in pharmaceutical research.
We used a PyG example with GIN model and the Mutag Dataset from the TUDatasets collection for the first experiment.
This dataset is very small (200 molecules), describing classification problem. It is a collection of nitroaromatic compounds, and the goal was to predict their mutagenicity in Salmonella typhimurium. It includes 188 samples of chemical compounds with 7 discrete node labels. In the Figure below, we show the profile of accuracy and loss for centralized (also called stand-alone) and the Federated settings for Batch Size equal to 100.
The second experiment was based on a custom model prepared by AbbVie and used the HIV dataset from MoleculeNet.
Below is a code snippet that illustrates the kind of GNN models we focused on in this second experiment.
The dataset used is highly unbalanced, containing around 45000 molecules classified based on their activity against HIV or, more precisely, on their experimentally measured abilities to inhibit HIV replication. The ratio between 0s and 1s in the dataset is around 27.
As shown in the figures, the training profiles of the cases studied look similar, with the initial and final loss values comparable, and the same can be said of other quantities we tracked during training.
This provides clues about the computational consistency across the two training experiments, indicating that GNN training for the inter-graph case is fully supported/amenable in federated settings; PyG and OpenFL together hold promise to address more complex cases.
3.3 Conclusions
Our experiment results were very promising, indicating that Federated Learning is appropriate for training GNNs, especially in the inter-graph settings. As stated above, the work was conducted in a simulated Federated Learning environment with a sample dataset size. We invite others to leverage our work with larger real-world data. Please contact us if you would like to collaborate.
Authors:
Andrea Zanetti ; Mattson Thieme ; Abhishek Pandey ; Patrick Foley ; Prashant Shah ; Malini Bhandaru , Marek Strachacki