Skip to content

Latest commit

 

History

History
142 lines (102 loc) · 5.24 KB

README.md

File metadata and controls

142 lines (102 loc) · 5.24 KB

dist-truncate

PyPI - Version PyPI - License Coverage Status

A small python package for truncating (continuous) scipy distributions.

Installation

From pypi

To install this package from pypi simply run

pip install dist-truncate

From source

To install from source clone this repository and install via pip:

git clone https://github.com/schoennenbeck/dist-truncate.git
cd dist-truncate
pip install .

Usage

Introduction

After installation simply import the package

import dist_truncate

This adds a property truncated to all distributions of type scipy.stats.rv_continuous. The property is itself of type scipy.stats.rv_continuous and works the same way as the distribution that was started with but with two additional shape-arguments trunc_min and trunc_max that can be used to truncated the support of the distribution to the interval trunc_min <= x <= trunc_max.

Note that some distributions (e.g. the normal distribution) already have an explicit truncated version implemented. In this case the explicit version should be used since it is most likely numerically more stable than this generic implementation.

A note on monkey patching

If you prefer not to use the monkey-patched version of scipy.stats.rv_continuous you can also use the truncation function directly:

from scipy import stats
from dist_truncate.truncate import truncated

dist = truncated(stats.loguniform)

Example

Let us truncate the standard normal distribution to the interval -0.5 <= x <= 2.0. As noted above the truncated normal is already implemented so we can compare the results and make sure that this generic version actually computes the right outputs.

import dist_truncate
from scipy import stats
import numpy as np

Comparing the cummulative density functions at multiple points:

stats.truncnorm.cdf(np.arange(-3, 3, 0.5), -0.5, 2.0)
# Output: array([0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.28631514, 0.57263027, 0.79676594, 0.93411656,
       1.        , 1.        ])
stats.norm.truncated.cdf(np.arange(-3, 3, 0.5), -0.5, 2.0)
# Output: array([0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.28631514, 0.57263027, 0.79676594, 0.93411656,
       1.        , 1.        ])

Comparing the inverse of the cdf:

stats.truncnorm.ppf(np.arange(0, 1, 0.2), -0.5, 2.0)
# Output: array([-0.5       , -0.14519108,  0.19172827,  0.55269814,  1.00897798])
stats.norm.truncated.ppf(np.arange(0, 1, 0.2), -0.5, 2.0)
# Output: array([-0.5       , -0.14519108,  0.19172827,  0.55269814,  1.00897798])

The minimum and maximum for the truncation can also be called as keyword-arguments:

stats.loguniform.truncated.support(1, 100, trunc_max=80, trunc_min=20)
# Output: (20, 80)

Broadcasting of shapes works the same as for all other arguments:

stats.loguniform.truncated.support(
    [1, 50],
    [50, 100],
    trunc_min=20,
    trunc_max=[[30, 40], [60, 100]]
)
# Output: (array([[20, 50], [20, 50]]), array([[ 30,  40], [ 50, 100]]))

A note on truncated distributions

This section contains a brief description of how truncated distributions in general and this implementation in particular work. If you just want to use the package you can safely skip this.

Let $X$ be a continuously distributed real-valued random variable with density function $f_X$ and cummulative density function $F_X$. Also let $a&gt;b \in \mathbb{R}$.

From a sampling perspective we get the truncated distribution $X_{|[a,b]}$ by sampling from $X$ until the sample lies in $[a,b]$. Equivalently, we can define $X_{|[a,b]}$ as having the same density function as $X$ but restricted to the interval $[a,b]$ and then rescaled such that all of $\mathbb{R}$ has probability $1$ again, i.e.

$ f_{X_{|[a,b]}}(x) = f_X(x) \cdot \mathbb{1}_{[a,b]}(x) \cdot \frac{1}{(F_X(b)-F_X(a))} $

where

$ \mathbb{1}_{[a,b]}(x) = \begin{cases} 1, & x \in [a,b] \ 0, & \text{else} \end{cases} $

Note that all of this only works if $F_X(b)&gt;F_X(a)$ so that the interval $[a,b]$ has positive probability (this is checked in the _argcheck method in the implementation).

Implementing this modified density function as the _pdf-method of the new distribution would already be enough to make this work in scipy (see documentation). However, working out the corresponding cummulative density function, its inverse, the survival function and so on and implementing them as _cdf, _ppf, _sf etc for the new distribution makes the implementation much faster and more numerically stable. For instance the cummulative density function is

$ F_{X_{|[a,b]}}(x) = \begin{cases} 0, & x< a \ \frac{F_X(x)-F_X(a)}{F_X(b)-F_X(a)}, & x \in [a,b] \ 1, & x>b\end{cases} $