MultiModal Federated Learning on NonIID Data
Existing work in federated learning focused on addressing unimodal tasks, where training generally embraces one modality, such as images or texts. As a result, the global model is unimodal, containing a modalityspecific neural network structure, using samples from a specific modality as its input for training. It is intriguing to explore how federated learning can be extended to the realm of multimodal tasks effectively, as models are trained with datasets from multiple modalities, such as images and texts. Though several existing studies proposed to train multimodal models with federated learning, the specific challenges imposed by nonIID data distributions in the context of multimodel federated learning, referred to as multimodal FL, have not been explored.
When nonIID data in federated learning is referenced, this typically refers to differences between local data distributions for different clients. The most common one is the distributionbased label nonIID data in which class distribution among clients follows the Dirichlet distribution.
The original objective of this project is to foray into uncharted territory by focusing on the effects of nonIID data distributions in multimodal FL. Intuitively, as more information is available from complementary modalities, multimodal FL should outperform the corresponding unimodal FL. However, we make the counterintuitive observation from our experimental results that multimodal FL not only leads to more communication rounds needed to reach convergence, but also lags far behind conventional unimodal FL with respect to the accuracy of trained models.
Figure. A performance comparison between the Federated Averaging (FedAvg) algorithm with unimodal FL, called FedAvgRGB, and multimodal FL, called FedAvgARGB, performed on the Kinetics dataset. With FedAvgRGB, images are used as the sole input for training the global model; while with FedAvgARGB, both images and audio are used as input for training. Each client computes the validation accuracy by evaluating the updated local model on the local validation dataset. Boxplots are used to show the divergence among local models at participating clients.
We support this claim by training global models for the action recognition task using distributionbased label nonIID data, where the class distribution among clients follows the Dirichlet distribution (with a concentration parameter of 0.5). The classical FL algorithm, named Federated Averaging (FedAvg), is applied to train the unimodal and multimodal global models, abbreviated as FedAvgRGB and FedAvgARGB, respectively. Specifically, two empirical observations can be drawn from the figure above. First, the overall validation accuracy of FedAvgRGB is consistently higher than that of FedAvgARGB during the training process. As the number of communication rounds increases, the performance of the multimodal model lags behind that of the unimodal model by an increasing margin. Second, compared with FedAvgRGB, FedAvgARGB shows a higher degree of divergence across participating clients with respect to the validation accuracy over their local models. This, in general, causes a lower convergence speed and a lower final accuracy after convergence is achieved.
To address this challenge, we reveal how nonIID multimodal data affects the federated training process by presenting a thorough theoretical analysis. Our results have shown that multimodal FL’s lower performance can be attributed to weight divergence, which quantifies the difference of weights updated based on nonIID multimodal data and centralized data. Specifically, due to the complexity of nonIID multimodal data, both the local data distribution across clients and the distribution between modalities can be extremely heterogeneous. As a result, multimodal models are often prone to overfitting as they are trained on unbalanced and smallsize local datasets. To make matters worse, overfitting and inconsistent generalization rates appear in the modality subnetworks and the local models simultaneously. With unimodal FL, existing mechanisms have been proposed to address the problems of overfitting and heterogeneity among clients, such as local adaptation and weighted global aggregation. However, our experiments show that these mechanisms were not able to provide an effective solution in the context of multimodal FL.
Built upon these theoretical insights, the crux of this work is the design of a new mechanism, referred to as hierarchical gradient blending (HGB), that adaptively computes an optimal blending of modalities and reweighs updates from the clients according to their overfitting and generalization behaviour. Intuitively, HGB corresponds to an optimization problem with overfittingtogeneralization rate minimization as the objective function.
$$\min_{\left\{z_m\right\}_{m=1}^M,\left\{p_k\right\}_{k=1}^K} \left(\frac{[L^T(\mathbf{w}_{t_{p1}})L^T(\mathbf{w}_{t_p})][L^*(\mathbf{w}_{t_{p1}})L^*(\mathbf{w}_{t_p})]}{L^*(\mathbf{w}_{t_{p1}})L^*(\mathbf{w}_{t_p})}\right)^2$$
where $L^T$
is the training loss, and the generalization error $L^*$
measures how accurately the model can predict outcome values for previously unseen data. In general, $L^*$
is able to be computed by applying the model to the validation dataset. Adjacent global weights $\mathbf{w}_{t_{p1}}$
and $\mathbf{w}_{t_{p}}$
are obtained by aggregating local updates from participanting clients. $t_{p}$
and $t_{p1}$
denotes two adjacent communication rounds.
The obtained optimal modality weights $\left\{z_m\right\}_{m=1}^M$
and aggregation weights $\left\{p_k\right\}_{k=1}^K$
can reduce generalization errors while mitigating the divergence between local updates when training the global model. Our new mechanism does not introduce any trainable parameters, making it computationally friendly and easy to use. From a theoretical perspective, we show that HGB guarantees convergence in multimodal FL with nonIID data.
We have evaluated the performance of our proposed HGB training algorithm in multimodal FL using Plato, an opensource research framework to facilitate scalable federated learning research. Based on the Kinetics dataset that contains three modalities for action recognition tasks, we wish to evaluate the following:

whether the multimodal global model trained by FedHGB can outperform its centrally trained unimodal counterparts;

whether FedHGB is able to achieve higher performance than the stateoftheart heterogeneous federated optimization method, FedNova, in terms of both accuracy and convergence speed.
After performing extensive experiments on 50 clients with multimodal nonIID data, the accuracy of FedHGB is observed to be 2.06 higher than the centrally trained RGB models while achieving an improvement of 7.85 over the FedNova, with a reduction of communication rounds by at least 100. Our experimental results have demonstrated convincing evidence that FedHGB tends to assign higher weights to those clients with low generalization errors, making it a generalizationaware algorithm, which is consistent with our theoretical design.
Authored by Sijia Chen, this work has been published in the Proceedings of IEEE INFOCOM 2022.