(Contact: Chirag Shetty - [email protected] )
Contents:
Deep Neural Networks can be challenging or impossible to train when either devices have limited memory, or the models are large. Splitting the model graph across multiple devices, today, is largely heuristic based and manual. We present the Baechi system, where we adopt an algorithmic approach to the placement problem for running machine learning training graphs on a small cluster of memory-constrained devices. Baechi, built over both PyTorch and Tensorflow , automatically and optimally splits the model, given a number of GPU devices and their memory capacities.
This document outlines the design and how to use Baechi-PyTorch.
BaechiPyTorch
|
|___ main_training : Training scripts for your provided model
| |__ calibrations: scripts to measure GPU-to-GPU comm speed and GPU compute speed
|
|___ model_library: the models to be used with Baechi must be placed here
|
|___ baechi_core: baechi implementation. placer library was borrowed from baechi-tensorflow and modified
|
|___ experiment_scripts: bash scripts to recreate experiments reported in the paper
Why is Model Parallelism hard in Pytorch?
Baechi_pytorch_system_design.pdf
PDF here: link