The need for faster processing of large amounts of data, in some cases with a notion of locality, drives the design of distributed approaches to machine learning. Two common considerations are: i) which parts of a computation to distribute and reduce/synchronize, and ii) how to minimize communication latencies.
These considerations pose an optimization problem in the “infrastructure vs. model architecture” search space.
To effectively traverse this search space, it is necessary to:
- iterate across infrastructures and model architectures, and
- measure the “fit” between an infrastructure and a model architecture.
Towards this goal, as an Insight Fellow, I developed:
- a one-step provisioning of an on-premises capable k8s infrastructure to streamline infrastructure iteration, and
- a set of proof-of-concept modules for constructing objective functions based on latency tracing
see further instructions in infrastructure-as-opt/infrastructure.
see further instructions in infrastructure-as-opt/{cpu, gpu}-build-jobs.