Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
software [2019/07/03 14:12]
fablpd
software [2021/09/23 09:34] (current)
guirguis
Line 1: Line 1:
-====== Software projects developed at LPD ======+====== Software projects developed at DCL ======
  
 DCL has a github page where most new software projects are published: [[https://​github.com/​LPD-EPFL/​]] DCL has a github page where most new software projects are published: [[https://​github.com/​LPD-EPFL/​]]
 +
 +===== Garfield =====
 +
 +Designed for the DSN '21 paper: "//​Garfield:​ System Support for Byzantine Machine Learning//​."​
 +
 +Garfield is a library that transparently makes machine learning (ML) applications,​ initially built with popular (but fragile) frameworks, e.g., TensorFlow and PyTorch, Byzantine-resilient. Garfield relies on a novel object-oriented design, reducing the coding effort, and addressing the vulnerability of the shared-graph architecture followed by classical ML frameworks. It encompasses various communication patterns and supports computations on CPUs and GPUs, allowing addressing the general question of the practical cost of Byzantine resilience in ML applications. Garfield has been thoroughly experimented with three main ML architectures:​ (1) a single server with multiple workers, (2) several servers and workers, and (3) peer–to–peer settings. Using Garfield, we highlight interesting facts about the cost of Byzantine resilience. In particular, (1) Byzantine resilience, unlike crash resilience, induces an accuracy loss, (2) the throughput overhead comes more from communication than from robust aggregation,​ and (3) tolerating Byzantine servers costs more than tolerating Byzantine workers.
 +The source code of Garfield was evaluated by experts from C4DT@EPFL, and the open-source version was also used in other projects.
 +
 +[[https://​github.com/​LPD-EPFL/​Garfield|Code]]
 +
 +===== FeGAN =====
 +
 +Designed for the Middleware '20 paper: "//​FeGAN:​ Scaling Distributed GANs//​."​
 +
 +FeGAN is a system to train generative adversarial networks (GANs) in the federated learning setup. FeGAN has a scalable design while being also robust to non-iidness of data (i.e., tolerates skewed distribution of data on devices). FeGAN makes three important design choices to achieve its goals: (1) co-locating the discriminator and the generator networks on all devices, (2) using balanced sampling, and (3) using KL-weighting. The first decision promotes the scalability of FeGAN and reduces the probability of falling into the vanishing gradients problem. Balanced sampling enables FeGAN to not fall into the mode collapse problem while KL-weighting is designed to resist learning divergence.
 +Unlike existing distributed GANs approaches, FeGAN scales to hundreds of devices. Moreover, FeGAN achieves 5x throughput gain while using 1.5x less bandwidth compared to its state-of-the-art competitor, namely MD-GAN. It also boosts training by 2.6x compared to the celebrated Federated Averaging (FedAvg) algorithm. ​
 +The source code of FeGAN was evaluated by experts and was given ACM accreditations for being functional and reusable.
 +
 +[[https://​github.com/​LPD-EPFL/​FeGAN|Code]]
 +
 +===== AggregaThor =====
 +
 +Designed for the MLSYS '19 paper: "//​AggregaThor:​ Byzantine Machine Learning via Robust Gradient Aggregation//​."​
 +
 +AggregaThor is the first scalable Byzantine resilient framework for distributed machine learning applications. AggregaThor is built on top of TensorFlow while achieving transparency:​ applications built with TensorFlow do not need to change their interfaces to be made Byzantine-resilient. AggregaThor uses the parameter server architecture,​ and it adds (to vanilla TensorFlow) two main layers: (1) the aggregation layer and (2) the communication layer. The former uses a statistically-robust gradient aggregation rule, called Multi-Krum, to robustly aggregate workers'​ gradients, ensuring convergence of training even in the existence of malicious workers. The communication layer enables users to experiment with unreliable transport layer (i.e., using UDP), which achieves better performance than vanilla TensorFlow in highly-saturated networks. The source code of AggregaThor was evaluated by experts and was given ACM accreditations for being functional and reusable.
 +
 +[[https://​github.com/​LPD-EPFL/​AggregaThor|Code]]
  
 ===== MVTIL ===== ===== MVTIL =====