What?

Deep Neural Networks can be challenging or impossible to train when either devices have limited memory, or the models are large. Splitting the model graph across multiple devices, today, is largely heuristic based and manual. We present the Baechi system, where we adopt an algorithmic approach to the placement problem for running machine learning training graphs on a small cluster of memory-constrained devices. Baechi, built over both PyTorch and Tensorflow , automatically and optimally splits the model, given a number of GPU devices and their memory capacities.

This document outlines the design and how to use Baechi-PyTorch.

Paper and Source Code

Source code of Baechi (TensorFlow and PyTorch) can be found here
Original Baechi (TensorFlow) SoCC 2020 paper: Link
arxiv journal version can be found here: link1, (pdf) - Baechi journal version is currently under review as of March 2023.
Full details of Baechi-PyTorch implementation are not included in our submission. Please find them below.

Codebase:

BaechiPyTorch
|
|___  main_training : Training scripts for your provided model
|					|__ calibrations: scripts to measure GPU-to-GPU comm speed and GPU compute speed
|
|___  model_library: the models to be used with Baechi must be placed here
|
|___  baechi_core: baechi implementation. placer library was borrowed from baechi-tensorflow and modified
|
|___  experiment_scripts: bash scripts to recreate experiments reported in the paper

System Design of Baechi PyTorch

Why is Model Parallelism hard in Pytorch?

Baechi_pytorch_system_design.pdf

PDF here: link