PyTorch/XLA 2.0 release

miladm released this 12 Aug 07:23

· 2024 commits to master since this release

500e1c2

Cloud TPUs now support the PyTorch 2.0 release, via PyTorch/XLA integration. On top of the underlying improvements and bug fixes in PyTorch's 2.0 release, this release introduces several features, and PyTorch/XLA specific bug fixes.

Beta Features

PJRT runtime

Checkout our newest document; PjRt is the default runtime in 2.0.
New Implementation of xm.rendezvous with XLA collective communication which scales better (#4181)
New PJRT TPU backend through the C-API (#4077)
Use PJRT to default if no runtime is configured (#4599)
Experimental support for torch.distributed and DDP on TPU v2 and v3 (#4520)

FSDP

Add auto_wrap_policy into XLA FSDP for automatic wrapping (#4318)

Stable Features

Lazy Tensor Core Migration

Migration is completed, checkout this dev discussion for more detail.
Naively inherits LazyTensor (#4271)
Adopt even more LazyTensor interfaces (#4317)
Introduce XLAGraphExecutor (#4270)
Inherits LazyGraphExecutor (#4296)
Adopt more LazyGraphExecutor virtual interfaces (#4314)
Rollback to use xla::Shape instead of torch::lazy::Shape (#4111)
Use TORCH_LAZY_COUNTER/METRIC (#4208)

Improvements & Additions

Add an option to increase the worker thread efficiency for data loading (#4727)
Improve numerical stability of torch.sigmoid (#4311)
Add an api to clear counter and metrics (#4109)
Add met.short_metrics_report to display more concise metrics report (#4148)
Document environment variables (#4273)
Op Lowering
- _linalg_svd (#4537)
- Upsample_bilinear2d with scale (#4464)

Experimental Features

TorchDynamo (torch.compile) support

Checkout our newest doc.
Dynamo bridge python binding (#4119)
Dynamo bridge backend implementation (#4523)
Training optimization: make execution async (#4425)
Training optimization: reduce graph execution per step (#4523)

PyTorch/XLA GSPMD on single host

Preserve parameter sharding with sharded data placeholder (#4721)
Transfer shards from server to host (#4508)
Store the sharding annotation within XLATensor(#4390)
Use d2d replication for more efficient input sharding (#4336)
Mesh to support custom device order. (#4162)
Introduce virtual SPMD device to avoid unpartitioned data transfer (#4091)

Ongoing development

Ongoing Dynamic Shape implementation

Implement missing XLASymNodeImpl::Sub (#4551)
Make empty_symint support dynamism. (#4550)
Add dynamic shape support to SigmoidBackward (#4322)
Add a forward pass NN model with dynamism test (#4256)

Ongoing SPMD multi host execution (#4573)

Bug fixes & improvements

Support int as index type (#4602)
Only alias inputs and outputs when force_ltc_sync == True (#4575)
Fix race condition between execution and buffer tear down on GPU when using bfc_allocator (#4542)
Release the GIL during TransferFromServer (#4504)
Fix type annotations in FSDP (#4371)

Assets 2