Releases: aai-institute/pyDVL
Releases · aai-institute/pyDVL
v0.10.0
v0.10.0 - 💥📚🐞🆕 New valuation interface, improved docs, new methods, breaking changes and tons of improvements
After lots of work, bug-fixing, bug-introducing, fixing again, and a good measure of bike shedding, we bring a major update putting us closer to the final APIs. The main goals of this release were to improve usability, documentation, and extensibility.
- We have added a new module
pydvl.valuation
. Thepydvl.value
module is deprecated and will be removed in the next release. The new interface allows for a more consistent and flexible way to define and use valuation methods. It also simplifies experimentation, manipulation of results and data, as well as parallelization. - We have many improvements to the
influence
module including several new methods and approximations. - The whole documentation has been improved and consolidated, with detailed method descriptions and examples. See pydvl.org.
Added
- Simple result serialization to resume computation of values PR #666
- Simple memory monitor / reporting PR #663
- New stopping criterion
MaxSamples
PR #661 - Introduced
UtilityModel
and two implementationsIndicatorUtilityModel
andDeepSetsUtilityModel
for data utility learning PR #650 - Introduced the concept of
ResultUpdater
in order to allow samplers to declare the proper strategy to use by valuations PR #641 - Added Banzhaf precomputed values to some games. PR #641
- Introduced new
IndexIterations
, for consistent usage across allPowersetSamplers
PR #641 - Added
run_removal_experiment
for easy removal experiments PR #636 - Refactor Classwise Shapley valuation with the interfaces and sampler architecture PR #616
- Refactor KNN Shapley values with the new interface PR #610 PR #645
- Refactor MSR Banzhaf semivalues with the new sampler architecture. PR #605 PR #641
- Refactor group-testing shapley values with new sampler architecture PR #602
- Refactor least-core data valuation methods with more supported sampling methods and consistent interface. PR #580
- Refactor Owen-Shapley valuation with new sampler architecture. Enable use of
OwenSamplers
with all semi-values PR #597 PR #641 - New method
InverseHarmonicMeanInfluence
, implementation for the paperDataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and Diffusion Models
PR #582 - Add new backend implementations for influence computation to account for block-diagonal approximations PR #582
- Extend
DirectInfluence
with block-diagonal and Gauss-Newton approximation PR #591 - Extend
LissaInfluence
with block-diagonal and Gauss-Newton approximation PR #593 - Extend
NystroemSketchInfluence
with block-diagonal and Gauss-Newton approximation PR #596 - Extend
ArnoldiInfluence
with block-diagonal and Gauss-Newton approximation PR #598 - Extend
CgInfluence
with block-diagonal and Gauss-Newton approximation PR #601
Fixed
- Fixed
show_warnings=False
not being respected in subprocesses. Introducedsuppress_warninigs
decorator for more flexibility PR #647 PR #662 - Fixed several bugs in diverse stopping criteria, including: iteration counts, computing completion, resetting, nested composition PR #641 PR #650
- Fixed all weights of all samplers to ensure that mix-and-matching samplers and semi-value methods always works, for all possible combinations PR #641
- Fixed a bug whereby progress bars would not report the last step and remain incomplete PR #641
- Fixed the analysis of the adult dataset in the Data-OOB notebook PR #636
- Replace
np.float_
withnp.float64
andnp.alltrue
withnp.all
, as the old aliases are removed in NumPy 2.0 PR #604 - Fix a bug in
pydvl.utils.numeric.random_subset
where1 - q
was used instead ofq
as the probability of an element being sampled PR #597 - Fix a bug in the calculation of variance estimates for MSR Banzhaf PR #605
- Fix a bug in KNN Shapley values. See Issue 613 for details.
- Backport the KNN Shapley fix to the
value
module PR #633
Changed
- Slicing, comparing and setting of
ValuationResult
behave in a more natural and consistent way PR #660 PR #666 - Switched all semi-value coefficients and sampler weights to log-space in order to avoid overflows PR #643
- Updated and rewrote some of the MSR banzhaf notebook PR #641
- Updated Least-Core notebook PR #641
- Updated Shapley spotify notebook PR #628
- Updated Data Utility notebook PR #650
- Restructured and generalized
StratifiedSampler
to allow using heuristics, thus subsuming Variance-Reduced stratified sampling into a unified framework. Implemented the heuristics proposed in that paper PR #641 - Uniformly distribute test points across processes for KNNShapley. Fail for
GroupedDataset
PR #632 - Introduced the concept of logical vs data indices for
Dataset
, andGroupedDataset
, fixing inconsistencies in how the latter operates on indices. Also, both now return objects of the same type when slicing. PR #631 PR #648 - Use tighter bounds for the calculation of the minimal sample size that guarantees an epsilon-delta approximation in group testing (Jia et al. 2023) PR #602
- Dropped black, isort and pylint from the CI pipeline, in favour of ruff PR #633
- Breaking Changes
- Changed
DataOOBValuation
to only accept bagged models PR #636 - Dropped support for python 3.8 after EOL PR #633 - Rename parameter
hessian_regularization
ofDirectInfluence
toregularization
and change the type annotation to allow for block-wise regularization parameters PR #591 - Rename parameter
hessian_regularization
ofLissaInfluence
toregularization
and change the type annotation to allow for block-wise regularization parameters PR #593 - Remove parameter
h0
from init ofLissaInfluence
PR #593 - Rename parameter
hessian_regularization
ofNystroemSketchInfluence
toregularization
and change the type annotation to allow for block-wise regularization parameters PR #596 - Renaming of parameters of
ArnoldiInfluence
,hessian_regularization
->regularization
(modify type annotation),rank_estimate
->rank
PR #598 - Remove functions remove obsolete functions
lanczos_low_rank_hessian_approximation
,model_hessian_low_rank
frominfluence.torch.functional
PR #598 - Renaming of parameters of
CgInfluence
,hessian_regularization
->regularization
(modify type annotation),pre_conditioner
->preconditioner
,use_block_cg
->solve_simultaneously
PR #601 - Remove parameter
x0
fromCgInfluence
PR #601 - Rename module
influence.torch.pre_conditioner
->influence.torch.preconditioner
PR #601 - Refactor preconditioner:
- renaming
PreConditioner
->Preconditioner
- fit to
TensorOperator
PR #601 - Bumped
zarr
dependency to v3 [PR #668](https://github...
- renaming
- Changed
v0.9.2
0.9.2 - 🏗 Bug fixes, logging improvement
Added
- Add progress bars to the computation of
LazyChunkSequence
and
NestedLazyChunkSequence
PR #567 - Add a device fixture for
pytest
, which depending on the availability and
user input (pytest --with-cuda
) resolves to cuda device
PR #574
Fixed
- Fixed logging issue in decorator
log_duration
PR #567 - Fixed missing move of tensors to model device in
EkfacInfluence
implementation PR #570 - Missing move to device of
preconditioner
inCgInfluence
implementation
PR #572 - Raise a more specific error message, when a
RunTimeError
occurs in
torch.linalg.eigh
, so the user can check if it is related to a known
issue
PR #578 - Fix an edge case (empty train data) in the test
test_classwise_scorer_accuracies_manual_derivation
, which resulted
in undefined behavior (np.nan
toint
conversion with different results
depending on OS)
PR #579
Changed
- Changed logging behavior of iterative methods
LissaInfluence
and
CgInfluence
to warn on not achieving desired tolerance withinmaxiter
,
add parameterwarn_on_max_iteration
to set the level for this information
tologging.DEBUG
PR #567
v0.9.1
v0.9.0
🆕 New methods, better docs and bugfixes 📚🐞
Added
- New method
MSR Banzhaf
with accompanying notebook, and new stopping
criterionRankCorrelation
PR #520 - New method:
NystroemSketchInfluence
PR #504 - New preconditioned block variant of conjugate gradient PR #507
- Improvements to documentation: fixes, links, text, example gallery, LFS and more PR #532, PR #543
- Glossary of data valuation and influence terms in the documentation PR #537
- Documentation about writing notes for new features, changes or deprecations PR #557
Fixed
- Bug in
LissaInfluence
, when not using CPU device PR #495 - Memory issue with
CgInfluence
andArnoldiInfluence
PR #498 - Raising specific error message with install instruction when trying to load
pydvl.utils.cache.memcached
withoutpymemcache
installed. Ifpymemcache
is available, all symbols frompydvl.utils.cache.memcached
are available throughpydvl.utils.cache
PR #509
Changed
- Add property
model_dtype
to instances of typeTorchInfluenceFunctionModel
- Bump versions of CI actions to avoid warnings PR #502
- Add Python Version 3.11 to supported versions PR #510
- Documentation improvements and cleanup PR #521, PR #522
- Simplified parallel backend configuration PR #549
New Contributors
- @jakobkruse1 made their first contribution in #510
Full Changelog: v0.8.1...v0.9.0
v0.8.1
🆕 New method and notebook, Games with exact shapley values, bug fixes and cleanup 🏗
Added
- Implement new method: EkfacInfluence #451
- New notebook to showcase ekfac for LLMs #483
- Implemented exact games in Castro et al. 2009 and 2017 #341
Fixed
- Bug in using DaskInfluenceCalcualator with TorchnumpyConverter for single dimensional arrays #485
- Fix implementations of to methods of TorchInfluenceFunctionModel implementations #487
- Fixed bug with checking for converged values in semivalues #341
Docs
- Add applications of data valuation section, display examples more prominently, make all sections visible in table of contents, use mkdocs material cards in the home page #492
New Contributors
- @opcode81 made their first contribution in #481
- @dependabot made their first contribution in #455
Full Changelog: v0.8.0...v0.8.1
v0.8.0
🆕 New interfaces, scaling computation, bug fixes and improvements 🎁
Added
- New cache backends: InMemoryCacheBackend and DiskCacheBackend PR #458
- New influence function interface
InfluenceFunctionModel
- Data parallel computation with
DaskInfluenceCalculator
PR #26 - Sequential batch-wise computation and write to disk with
SequentialInfluenceCalculator
PR #377 - Adapt notebooks to new influence abstractions PR #430
Changed
- Refactor and simplify caching implementation PR #458
- Simplify display of computation progress PR #466
- Improve readme and explain better the examples PR #465
- Simplify and improve tests, add CodeCov code coverage PR #429
- Breaking Changes
- Removed
compute_influences
and all related code.
Replaced by newInfluenceFunctionModel
interface. Removed modules:- influence.general
- influence.inversion
- influence.twice_differentiable
- influence.torch.torch_differentiable
- Removed
Fixed
- Import bug in README PR #457
Full Changelog: v0.7.1...v0.8.0
v0.7.1
🆕 New methods, bug fixes and improvements for local tests 🐞🧪
Added
- New method: Class-wise Shapley values PR #338
- New method: Data-OOB by @BastienZim PR #426, PR #431
- Added
AntitheticPermutationSampler
PR #439 - Faster semi-value computation with per-index check of stopping criteria (optional) PR #437
Changed
- No longer using docker within tests to start a memcached server PR #444
- Using pytest-xdist for faster local tests PR #440
- Improvements and fixes to notebooks PR #436
- Refactoring of parallel module. Old imports will stop working in v0.9.0 PR #421
Fixed
- Fix initialization of
data_names
inValuationResult.zeros()
PR #443
v0.7.0
📚🆕 Documentation and IF overhaul, new methods and bug fixes 💥🐞
This is our first β release! We have worked hard to deliver improvements across
the board, with a focus on documentation and usability. We have also reworked
the internals of the influence
module, improved parallelism and handling of
randomness.
Added
- Implemented solving the Hessian equation via spectral low-rank approximation PR #365
- Enabled parallel computation for Leave-One-Out values PR #406
- Added more abbreviations to documentation PR #415
- Added seed to functions from
pydvl.utils.numeric
,pydvl.value.shapley
andpydvl.value.semivalues
. Introduced new typeSeed
and conversion functionensure_seed_sequence
. PR #396
Changed
- Replaced sphinx with mkdocs for documentation. Major overhaul of documentation PR #352
- Made ray an optional dependency, relying on joblib as default parallel backend PR #408
- Decoupled
ray.init
fromParallelConfig
PR #373 - Breaking Changes
- Signature change: return information about Hessian inversion from
compute_influence_factors
PR #375 - Major changes to IF interface and functionality. Foundation for a framework abstraction for IF computation. PR #278, PR #394
- Renamed
semivalues
tocompute_generic_semivalues
PR #413 - New
joblib
backend as default instead of ray. Simplify MapReduceJob. PR #355 - Bump torch dependency for influence package to 2.0. PR #365
- Signature change: return information about Hessian inversion from
Fixed
- Fixes to parallel computation of generic semi-values: properly handle all samplers and stopping criteria, irrespective of parallel backend. PR #372
- Optimize memory usage in IF calculation PR #375
- Fix adding valuation results with overlapping indices and different lengths PR #370
- Fixed bugs in conjugate gradient and
linear_solve
PR #358 - Fix installation of dev requirements for Python 3.10 PR #382
- Improvements to IF documentation PR #371
New Contributors
Full Changelog: v0.6.1...v0.7.0
v0.6.1
🏗 Bug fixes and minor improvements
- Fix parsing keyword arguments of
compute_semivalues
dispatch function by @kosmitive in #333 - Create new
RayExecutor
class based on the concurrent.futures API, use the new class to fix an issue with Truncated Monte Carlo Shapley (TMCS) starting too many processes and dying, plus other small changes by @AnesBenmerzoug in #329 - Fix creation of GroupedDataset objects using the
from_arrays
andfrom_sklearn
class methods by @AnesBenmerzoug in #334 - Fix release job not triggering on CI when a new tag is pushed by @AnesBenmerzoug in #331
- Added alias
ApproShapley
from Castro et al. 2009 for permutation Shapley by @mdbenito in #332
Full Changelog: v0.6.0...v0.6.1
v0.6.0
🆕 New algorithms, cleanup and bug fixes 🏗
- Fix/stopping checks by @mdbenito in #283
- Fix Monte Carlo Least Core error when n_iterations < len(dataset) by @AnesBenmerzoug in #281
- Hide parallel backend in tmcs main function by @mdbenito in #293
- Cosmetic changes to
Dataset
by @mdbenito in #290 - Refactor/nicer imports by @mdbenito in #284
- Fix StandardError stopping criterion by @mdbenito in #300
- Remove unpackable decorator, use asdict() by @mdbenito in #233
- Add burn-in param to AbsoluteStandardError by @mdbenito in #305
- Remove default non-negativity constraint on least core subsidy by @AnesBenmerzoug in #304
- Close #280: Add py.typed by @mdbenito in #307
- Minor docstring and cosmetic changes by @mdbenito in #317
- Allow passing additional kwargs to Dataset class' classmethods by @AnesBenmerzoug in #316
- Semi-values and samplers by @mdbenito in #319
- Remove bogus iter method. by @kosmitive in #326
- Improvements to ValuationResult by @mdbenito in #327
Full Changelog: v0.5.0...v0.6.0