Releases: ml-explore/mlx
Releases Β· ml-explore/mlx
v0.10.0
Highlights
- Improvements for LLM generation
- Reshapeless quant matmul/matvec
mx.async_eval
- Async command encoding
Core
- Slightly faster reshapeless quantized gemms
- Option for precise softmax
mx.metal.start_capture
andmx.metal.stop_capture
for GPU debug/profilemx.expm1
mx.std
mx.meshgrid
- CPU only
mx.random.multivariate_normal
mx.cumsum
(and other scans) forbfloat
- Async command encoder with explicit barriers / dependency management
NN
nn.upsample
support bicubic interpolation
Misc
- Updated MLX Extension to work with nanobind
Bugfixes
- Fix buffer donation in softmax and fast ops
- Bug in layer norm vjp
- Bug initializing from lists with scalar
- Bug in indexing
- CPU compilation bug
- Multi-output compilation bug
- Fix stack overflow issues in eval and array destruction
v0.9.0
Highlights:
- Fast partial RoPE (used by Phi-2)
- Fast gradients for RoPE, RMSNorm, and LayerNorm
- Up to 7x faster, benchmarks
Core
- More overhead reductions
- Partial fast RoPE (fast Phi-2)
- Better buffer donation for copy
- Type hierarchy and issubdtype
- Fast VJPs for RoPE, RMSNorm, and LayerNorm
NN
Module.set_dtype
- Chaining in
nn.Module
(model.freeze().update(β¦)
)
Bugfixes
- Fix set item bugs
- Fix scatter vjp
- Check shape integer overlow on array construction
- Fix bug with module attributes
- Fix two bugs for odd shaped QMV
- Fix GPU sort for large sizes
- Fix bug in negative padding for convolutions
- Fix bug in multi-stream race condition for graph evaluation
- Fix random normal generation for half precision
v0.8.0
Highlights
- More perf!
mx.fast.rms_norm
andmx.fast.layer_norm
- Switch to Nanobind substantially reduces overhead
- Up to 4x faster
__setitem__
(e.g.a[...] = b
)
Core
mx.inverse
, CPU only- vmap over
mx.matmul
andmx.addmm
- Switch to nanobind from pybind11
- Faster setitem indexing
mx.fast.rms_norm
, token generation benchmarkmx.fast.layer_norm
, token generation benchmark- vmap for inverse and svd
- Faster non-overlapping pooling
Optimizers
- Set minimum value in cosine decay scheduler
Bugfixes
- Fix bug in multi-dimensional reduction
v0.7.0
Highlights
- Perf improvements for attention ops:
- No copy broadcast matmul (benchmarks)
- Fewer copies in reshape
Core
- Faster broadcast + gemm
mx.linalg.svd
(CPU only)- Fewer copies in reshape
- Faster small reductions
NN
nn.RNN
,nn.LSTM
,nn.GRU
Bugfixes
- Fix bug in depth traversal ordering
- Fix two edge case bugs in compilation
- Fix bug with modules with dictionaries of weights
- Fix bug with scatter which broke MOE training
- Fix bug with compilation kernel collision
v0.6.0
Highlights:
- Faster quantized matrix-vector multiplies
mx.fast.scaled_dot_product_attention
fused op
Core
- Memory allocation API improvements
- Faster GPU reductions for smaller sizes (between 2 and 7x)
mx.fast.scaled_dot_product_attention
fused op- Faster quantized matrix-vector multiplications
- Pickle support for
mx.array
NN
- Dilation on convolution layers
Bugfixes
- Fix
mx.topk
- Fix reshape for zero sizes
v0.5.0
Highlights:
- Faster convolutions.
- Up to 14x faster for some common sizes.
- See benchmarks
Core
mx.where
properly handlesinf
- Faster and more general convolutions
- Input and kernel dilation
- Asymmetric padding
- Support for cross-correlation and convolution
atleast_{1,2,3}d
accept any number of arrays
NN
nn.Upsample
layer- Supports nearest neighbor and linear interpolation
- Any number of dimensions
Optimizers
- Linear schedule and schedule joiner:
- Use for e.g. linear warmup + cosine decay
Bugfixes
arange
throws oninf
inputs- Fix Cmake build with MLX
- Fix
logsumexp
inf
edge case - Fix grad of power w.r.t. to exponent edge case
- Fix compile with
inf
constants - Bug temporary bug in convolution
v0.4.0
Highlights:
- Partial shapeless compilation
- Default shapeless compilation for all activations
- Can be more than 5x faster than uncompiled versions
- CPU kernel fusion
- Some functions can be up to 10x faster
Core
- CPU compilation
- Shapeless compilation for some cases
mx.compile(function, shapeless=True)
- Up to 10x faster scatter: benchmarks
mx.atleast_1d
,mx.atleast_2d
,mx.atleast_3d
Bugfixes
- Bug with
tolist
withbfloat16
andfloat16
- Bug with
argmax
on M3
v0.3.0
Highlights:
mx.fast
subpackage- Custom
mx.fast.rope
up to 20x faster
Core
- Support metadata with
safetensors
- Up to 5x faster scatter and 30% faster gather
- 40% faster
bfloat16
quantizated matrix-vector multiplies mx.fast
subpackage with a fast RoPE- Context manager
mx.stream
to set the default device
NN
- Average and Max pooling layers for 1D and 2D inputs
Optimizers
- Support schedulers for e.g. learning rates
- A few basic schedulers:
optimizers.step_decay
optimizers.cosine_decay
opimtizers.exponential_decay
Bugfixes
- Fix bug in remainder with negative numerators and integers
- Fix bug with slicing into softmax
- Fix quantized matmuls with non 32 multiples
v0.2.0
Highlights:
mx.compile
makes stuff go fast- Some functions are up to 10x faster (benchmarks)
- Training models anywhere from 10% to twice as fast (benchmarks)
- Simple syntax for compiling full training steps
Core
mx.compile
function transformation- Find devices properly for iOS
- Up to 10x faster GPU gather
__abs__
overload forabs
on arraysloc
andscale
in parameter formx.random.normal
NN
- Margin ranking loss
- BCE loss with weights
Bugfixes
- Fix for broken eval during function transformations
- Fix
mx.var
to giveinf
withdoff >= nelem
- Fix loading empty modules in
nn.Sequential
v0.1.0
Highlights
- Memory use improvements:
- Gradient checkpointing for training with
mx.checkpoint
- Better graph execution order
- Buffer donation
- Gradient checkpointing for training with
Core
- Gradient checkpointing with
mx.checkpoint
- CPU only QR factorization
mx.linalg.qr
- Release Python GIL during
mx.eval
- Depth-based graph execution order
- Lazy loading arrays from files
- Buffer donation for reduced memory use
mx.diag
,mx.diagonal
- Breaking:
array.shape
is a Python tuple - GPU support for
int64
anduint64
reductions - vmap over reductions and arg reduction:
sum
,prod
,max
,min
,all
,any
argmax
,argmin
NN
- Softshrink activation
Bugfixes
- Comparisons with
inf
work, and fixmx.isinf
- Bug fix with RoPE cache
- Handle empty Matmul on the CPU
- Negative shape checking for
mx.full
- Correctly propagate
NaN
in some binary opsmx.logaddexp
,mx.maximum
,mx.minimum
- Fix > 4D non-contiguous binary ops
- Fix
mx.log1p
withinf
input - Fix SGD to apply weight decay even with 0 momentum