#### A summary of dataset distribution techniques for Federated Learning on the CIFAR benchmark dataset

**Federated Learning** (FL) is a method to train Machine Learning (ML) models in a distributed setting [1]. The idea is that clients (for example hospitals) want to cooperate without sharing their private and sensitive data. Each client holds their private data in FL and trains an ML model on it. Then a central server collects and aggregates the model parameters, thus building a global model based on information from all the data distribution. Ideally, this serves as** privacy protection by design**.

A long line of research has been done to understand FL's efficiency, privacy, and fairness. Here we will focus on the benchmark datasets used to evaluate *horizontal FL methods* where the clients share the same task and data type but they have their individual data samples.

If you want to know more about Federated Learning and what I work on, visit our research lab website!

There are three types of datasets in the literature:

**Real FL scenario**: an application where FL is a needed method. It has natural distributions and sensitive data. However, given the nature of FL if you want to keep the data locally you won’t publish the dataset online for benchmarking. Therefore it is hard to find a dataset of this kind. OpenMinded behind PySyft tries to organize an FL community of universities and research labs to host data in a more realistic scenario. Additionally, there are applications where the privacy-awareness has risen recently. So there might be publicly available data while the demand for FL exists. One application is smart electricity meters [2].**FL benchmark datasets**: these datasets are designed to serve as FL benchmarks. The distribution is realistic, but the sensitivity of the data is questionable as they are built from publicly available origins. One example is creating an FL dataset from Reddit posts using the users as clients and distributing it to one user as one partition. The LEAF project proposed more datasets like this [3].**Distributing standard datasets**: there are a couple of well-known datasets like CIFAR and ImageNet for images as an example used as a benchmark in many Machine Learning works. Here FL scientists define a distribution according to their research questions. It makes sense to use this method if the topic is well-studied on a standard ML scenario and one wants to compare their FL algorithm to centralized SOTA. However, this artificial distribution doesn’t reveal every problem with the distribution skew. For example, if the clients collect images with very different cameras or in different lighting conditions.

As the last category is not distributed by design, there are several ways past research works split them. In the rest of this story, I will summarise distribution techniques used for the CIFAR dataset in a federated scenario.

#### CIFAR dataset

The CIFAR-10 and CIFAR-100 datasets contain 32×32 colored images labeled to mutually exclusive classes [4]. The CIFAR-10 has 10 classes of 6000 images and the CIFAR-100 has 100 classes of 600 images. They are used in many image classification tasks and one can access dozens of models evaluated on them, even browsing them using a leaderboard on PapersWithCode.

### Data partitioning in Federated Learning

**Uniform distribution**

This is considered to be **identically and independently distributed** (IID) data. Data points are randomly allocated to clients.

#### Single (n-) class clients

Data points allocated for a specific client come from the same class or classes. It can be recognized as an extreme non-IID setting. Examples of this distribution are in [1,5–8]. The work first naming the setting as Federated Learning [1] uses 200 single-class sets and gives two sets to each client making them 2-class clients. [5–7] use 2-class clients.

[9] builds on the hierarchical classes in CIFAR-100: clients have data points from one subclass in each superclass. This way in the classification task for superclasses has clients with samples from each (super)class, yet a distribution skew is simulated as the data points are from different subclasses. For example, one client has access to lions while the other has tiger images, the superclass task is to categorize both as large carnivores.

#### Dominant class clients

[5] also uses a mixture of uniform and 2-class clients, which means half of the data points come from the 2 dominant classes, and the rest are uniformly selected from the other classes. [10] uses an 80%-20% partition 80% selected from a single dominant class and the rest is uniformly selected from the other classes.

#### Dirichlet distribution

To understand the Dirichlet distribution, I follow the example of this blog post. Let’s say one wants to produce a dice, with θ=(1/6,1/6,1/6,1/6,1/6,1/6) probabilities for each number 1–6. However, in reality, nothing can be perfect, so each die will be a bit skewed. 4 a bit more likely and 3 a bit less likely for example. The Dirichlet distribution describes this variety with a parameter vector **α=(**α₁,α₂,..,α₆**)**. Larger αᵢ strengthens the weight of that number and the larger overall sum of the αᵢ values ensures more similar sampled probabilities (dice). Turning back to the dice example, to have a fair die each αᵢ should be equal, and the larger the α value the better manufactured the dice are. As it is a multivariate generalization of the beta distribution, let’s display some examples of the beta distribution (Dirichlet distribution with two dice):

I reproduced the visualization in [11], using the same α value for αᵢ each. This is called a **symmetric Dirichlet distribution**. We can see that as the α value decreases it is more likely that there will be unbalanced dice. The figures below show the Dirichlet distribution for different α values. Here each row represents a class, each column is a client and the area of the circles is proportionate to the probabilities.

**Distribution over classes**: The samples for each client are drawn independently with class distribution following the Dirichlet method. [11, 16] use this version of the Dirichlet distribution.

Each client has a predetermined number of samples, but the classes are chosen randomly, thus the final total class representation will be unbalanced. In the clients, α→∞ is the prior (uniform) distribution while α→0 means single-class clients.

**Distribution over clients**: if we know the total number of samples in a class and the number of clients, we can distribute the samples to the clients class by class. This will result in clients having a different number of samples (which is very typical in FL), while the global class distribution is balanced. [12] use this variation of the Dirichlet distribution.

While works like [11–16] follow and cite each other using Dirichlet distribution, they use the two different methods. Furthermore, the different experiments use different α values which can result in very different performances. [11,12] uses α=0.1 and [13-15] uses α=0.5, [16] gives an overview of different α values. These design choices lose the original principle of using the same benchmark dataset to evaluate algorithms.

**Asymmetric Dirichlet distribution: **one can use different αᵢ values to simulate more resourceful clients. For example, the figure below is produced using 1/*i* for the *i*th client. It is not represented in the literature to my knowledge, instead, Zipf distribution is used in [17].

#### Zipf distribution

[17] uses a combination of Zipf and Dirichlet distributions. It uses the Zipf distribution to determine the number of samples at each client and then selects the class distribution using the Dirichlet.

In the Zipf (zeta) distribution the frequency of an item is inversely proportional to its rank in a frequency table. Zipf’s law can be observed in many real-world datasets, for example regarding the word frequency in language corpora [18].

### Conclusion

Benchmarking federated learning methods is a challenging task. Ideally, one uses predefined real federated datasets. However, if a certain scenario has to be simulated without a good existing dataset to cover it, one can use data distribution techniques. Proper documentation for reproducibility and motivation of the design choice is important. Here I summarized the most common methods already in use for FL algorithm evaluation. Visit this Colab notebook for the codes used for this story!

### References

[1] McMahan, B., Moore, E., Ramage, D., Hampson, S., & y Arcas, B. A. (2017, April). Communication-efficient learning of deep networks from decentralized data. In *Artificial intelligence and statistics* (pp. 1273–1282). PMLR.

[2] Savi, M., & Olivadese, F. (2021). Short-term energy consumption forecasting at the edge: A federated learning approach. *IEEE Access*, *9*, 95949–95969.

[3] Caldas, S., Duddu, S. M. K., Wu, P., Li, T., Konečný, J., McMahan, H. B., … & Talwalkar, A. (2019). Leaf: A benchmark for federated settings. *Workshop on Federated Learning for Data Privacy and Confidentiality*

[4] Krizhevsky, A. (2009). Learning Multiple Layers of Features from Tiny Images. *Master’s thesis, University of Tront*.

[5] Liu, W., Chen, L., Chen, Y., & Zhang, W. (2020). Accelerating federated learning via momentum gradient descent. *IEEE Transactions on Parallel and Distributed Systems*, *31*(8), 1754–1766.

[6] Zhang, L., Luo, Y., Bai, Y., Du, B., & Duan, L. Y. (2021). Federated learning for non-iid data via unified feature learning and optimization objective alignment. In *Proceedings of the IEEE/CVF international conference on computer vision* (pp. 4420–4428).

[7] Zhang, J., Guo, S., Ma, X., Wang, H., Xu, W., & Wu, F. (2021). Parameterized knowledge transfer for personalized federated learning. *Advances in Neural Information Processing Systems*, *34*, 10092–10104.

[8] Zhao, Y., Li, M., Lai, L., Suda, N., Civin, D., & Chandra, V. (2018). Federated learning with non-iid data. *arXiv preprint arXiv:1806.00582*.

[9] Li, D., & Wang, J. (2019). Fedmd: Heterogenous federated learning via model distillation. *arXiv preprint arXiv:1910.03581*.

[10] Wang, H., Kaplan, Z., Niu, D., & Li, B. (2020, July). Optimizing federated learning on non-iid data with reinforcement learning. In *IEEE INFOCOM 2020-IEEE Conference on Computer Communications* (pp. 1698–1707). IEEE.

[11] Lin, T., Kong, L., Stich, S. U., & Jaggi, M. (2020). Ensemble distillation for robust model fusion in federated learning. *Advances in Neural Information Processing Systems*, *33*, 2351–2363.

[12] Luo, M., Chen, F., Hu, D., Zhang, Y., Liang, J., & Feng, J. (2021). No fear of heterogeneity: Classifier calibration for federated learning with non-iid data. *Advances in Neural Information Processing Systems*, *34*, 5972–5984.

[13] Yurochkin, M., Agarwal, M., Ghosh, S., Greenewald, K., Hoang, N., & Khazaeni, Y. (2019, May). Bayesian nonparametric federated learning of neural networks. In *International conference on machine learning* (pp. 7252–7261). PMLR.

[14] Wang, H., Yurochkin, M., Sun, Y., Papailiopoulos, D., & Khazaeni, Y. (2020) Federated Learning with Matched Averaging. In *International Conference on Learning Representations*.

[15] Li, Q., He, B., & Song, D. (2021). Model-contrastive federated learning. In *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition* (pp. 10713–10722).

[16] Hsu, T. M. H., Qi, H., & Brown, M. (2019). Measuring the effects of non-identical data distribution for federated visual classification. *arXiv preprint arXiv:1909.06335*.

[17] Wadu, M. M., Samarakoon, S., & Bennis, M. (2021). Joint client scheduling and resource allocation under channel uncertainty in federated learning. *IEEE Transactions on Communications*, *69*(9), 5962–5974.

[18] Fagan, Stephen; Gençay, Ramazan (2010), “An introduction to textual econometrics”, in Ullah, Aman; Giles, David E. A. (eds.), *Handbook of Empirical Economics and Finance*, CRC Press, pp. 133–153

From Centralized to Federated Learning was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.