From f58e3305ea38c9c964a61b307af23c53f7fb568c Mon Sep 17 00:00:00 2001 From: Matt Bartos Date: Tue, 26 Mar 2019 01:40:58 -0400 Subject: [PATCH] Updates to documentation, unit tests and version (#52) * Update dev from master (#47) * Working version of insert_point * Fix bug with query and insertpoint * Cleanup disp and codisp functions * Reorder methods * Insertpoint operational, add docstrings * Small fix * Remove old files * Allow empty tree; handle duplicates in insert_point and forget_point * Account for duplicates in tree construction * Add ability to print tree * Update docstrings and minor fixes * Docstring fix * Minor fixes * Bugfix for forget_point * Add image * Update readme * Update README.md * Type check for point * Update README.md * Update README.md * Store bounding boxes * Minor changes * Fix bbox bug, add unit tests * Add support for CI * Remove Python 3.7 * Update README.md * Fix duplicate precision bug * Fix duplicates issue? * Use new indexing strategy with forget_point * Update n bug * Return 0 for codisp if leaf is root * Add efficient shingle * Add sine wave image * Update README.md * Minor cleanup * sklearn test * Add classification notebook * Updated gitignore * Edit classification notebook * taxi data test * removed swamp * taxi data 200 tree run * IF test * Add OC-SVM example with sine wave * Add OC-SVM example with taxi data * Add IF example with sine wave * Minor changes * Minor updates * Delete old sine_ocsvm_test notebook * Delete old taxi_ocsvm_test notebook * Delete old sine_if notebook * Add IF example with sine wave * Add OC-SVM example with sine wave * Add OC-SVM example with taxi data * Delete old taxi_ocsvm notebook * Add OC-SVM example with taxi data * sine wave comaprasion * table1 notebook * rrcf notebook * renamed rrcf * taxi data if * Fix shingle bug; clean up classification example * Set theme jekyll-theme-minimal * Add index labels * Update batch image * Update README.md * Update README.md * Update README.md * Create _config.yml * Create default.html * Create index.md * Update index.md * Create nav.html * Create nav.yml * Create tree-construction.html * Rename tree-construction.html to tree-construction.md * Create insert-and-delete.md * Create anomaly-scoring.md * Create batch.md * Create streaming.md * Update tree-construction.md * Update tree-construction.md * Update tree-construction.md * Update tree-construction.md * Update tree-construction.md * Update tree-construction.md * Update insert-and-delete.md * Update insert-and-delete.md * Update tree-construction.md * Update insert-and-delete.md * Update insert-and-delete.md * Update anomaly-scoring.md * Create related-work.md * Update nav.yml * Update related-work.md * Create random-cut-tree.md * Update nav.yml * Create modifying-rctree.md * Update nav.yml * Update random-cut-tree.md * Update random-cut-tree.md * Update random-cut-tree.md * Update random-cut-tree.md * Update modifying-rctree.md * Update random-cut-tree.md * Create scoring-rctree.md * Update nav.yml * Update scoring-rctree.md * Update README.md * Update README.md * Update index.md * Update index.md * Update index.md * Create paper.md * Create paper.bib * Add files via upload * Update paper.md * Update README.md * Delete figure_1.png * Add files via upload * Update paper.md * Add files via upload * Create taxi.md * Update nav.yml * Update batch.md * Update streaming.md * Update streaming.md * Update streaming.md * Update streaming.md * Update taxi.md * Update related-work.md * Update related-work.md * Update related-work.md * Update tree-construction.md * Update insert-and-delete.md * Update anomaly-scoring.md * Update related-work.md * Update tree-construction.md * Update insert-and-delete.md * Update taxi.md * Update README.md * Update batch.md * Update streaming.md * Updates to authors * Update paper * updated abhi orcid * Create rctree-api.md * Update rctree-api.md * Update rctree-api.md * Update rctree-api.md * Update rctree-api.md * Update rctree-api.md * Update rctree-api.md * Update rctree-api.md * Update nav.yml * Update rctree-api.md * Update rctree-api.md * Update docstrings * Update rctree-api.md * Update rctree-api.md * Update anomaly-scoring.md * Update random-cut-tree.md * Update rctree-api.md * Update related-work.md * Update rctree-api.md * Update setup.py * Update __init__.py * Update paper.md * Update paper.md * JOSS review updates (#48) * Working version of insert_point * Fix bug with query and insertpoint * Cleanup disp and codisp functions * Reorder methods * Insertpoint operational, add docstrings * Small fix * Remove old files * Allow empty tree; handle duplicates in insert_point and forget_point * Account for duplicates in tree construction * Add ability to print tree * Update docstrings and minor fixes * Docstring fix * Minor fixes * Bugfix for forget_point * Add image * Update readme * Update README.md * Type check for point * Update README.md * Update README.md * Store bounding boxes * Minor changes * Fix bbox bug, add unit tests * Add support for CI * Remove Python 3.7 * Update README.md * Fix duplicate precision bug * Fix duplicates issue? * Use new indexing strategy with forget_point * Update n bug * Return 0 for codisp if leaf is root * Add efficient shingle * Add sine wave image * Update README.md * Minor cleanup * sklearn test * Add classification notebook * Updated gitignore * Edit classification notebook * taxi data test * removed swamp * taxi data 200 tree run * IF test * Add OC-SVM example with sine wave * Add OC-SVM example with taxi data * Add IF example with sine wave * Minor changes * Minor updates * Delete old sine_ocsvm_test notebook * Delete old taxi_ocsvm_test notebook * Delete old sine_if notebook * Add IF example with sine wave * Add OC-SVM example with sine wave * Add OC-SVM example with taxi data * Delete old taxi_ocsvm notebook * Add OC-SVM example with taxi data * sine wave comaprasion * table1 notebook * rrcf notebook * renamed rrcf * taxi data if * Fix shingle bug; clean up classification example * Set theme jekyll-theme-minimal * Add index labels * Update batch image * Update README.md * Update README.md * Update README.md * Create _config.yml * Create default.html * Create index.md * Update index.md * Create nav.html * Create nav.yml * Create tree-construction.html * Rename tree-construction.html to tree-construction.md * Create insert-and-delete.md * Create anomaly-scoring.md * Create batch.md * Create streaming.md * Update tree-construction.md * Update tree-construction.md * Update tree-construction.md * Update tree-construction.md * Update tree-construction.md * Update tree-construction.md * Update insert-and-delete.md * Update insert-and-delete.md * Update tree-construction.md * Update insert-and-delete.md * Update insert-and-delete.md * Update anomaly-scoring.md * Create related-work.md * Update nav.yml * Update related-work.md * Create random-cut-tree.md * Update nav.yml * Create modifying-rctree.md * Update nav.yml * Update random-cut-tree.md * Update random-cut-tree.md * Update random-cut-tree.md * Update random-cut-tree.md * Update modifying-rctree.md * Update random-cut-tree.md * Create scoring-rctree.md * Update nav.yml * Update scoring-rctree.md * Update README.md * Update README.md * Update index.md * Update index.md * Update index.md * Create paper.md * Create paper.bib * Add files via upload * Update paper.md * Update README.md * Delete figure_1.png * Add files via upload * Update paper.md * Add files via upload * Create taxi.md * Update nav.yml * Update batch.md * Update streaming.md * Update streaming.md * Update streaming.md * Update streaming.md * Update taxi.md * Update related-work.md * Update related-work.md * Update related-work.md * Update tree-construction.md * Update insert-and-delete.md * Update anomaly-scoring.md * Update related-work.md * Update tree-construction.md * Update insert-and-delete.md * Update taxi.md * Update README.md * Update batch.md * Update streaming.md * Updates to authors * Update paper * updated abhi orcid * Create rctree-api.md * Update rctree-api.md * Update rctree-api.md * Update rctree-api.md * Update rctree-api.md * Update rctree-api.md * Update rctree-api.md * Update rctree-api.md * Update nav.yml * Update rctree-api.md * Update rctree-api.md * Update docstrings * Update rctree-api.md * Update rctree-api.md * Update anomaly-scoring.md * Update random-cut-tree.md * Update rctree-api.md * Update related-work.md * Update rctree-api.md * Update setup.py * Update __init__.py * Update paper.md * Update paper.md * Update README.md * Update paper.bib * Update paper.md * Update paper.md * JOSS review suggested changes * Add license badge * Move installation instructions; list dependencies * Add coveralls support * Add classification and comparison to docs * Move notebook material into documentation * Add example data * Update data locations in docs * Fix error with coveralls build * Add coveralls badge * Add init for pytest * Increase test coverage * Add version numbers to dependencies * Update README.md * Update index.md * Add caveats documentation * Minor edit to caveats * Fix spacing it comparisons documentation * Spacing updates to docs * Update version post-JOSS review --- README.md | 46 ++++++++++++++++++++++++---- docs/_data/nav.yml | 2 ++ docs/caveats.md | 27 +++++++++++++++++ docs/classification.md | 69 +++++++++++++++++++++++++++--------------- docs/comparisons.md | 68 ++++++++++++++++++++++++++--------------- docs/index.md | 67 ++++++++++++++++++++++++++++++++++++---- rrcf/__init__.py | 2 +- setup.py | 2 +- 8 files changed, 219 insertions(+), 64 deletions(-) create mode 100644 docs/caveats.md diff --git a/README.md b/README.md index 03d4291..0cbb9ab 100644 --- a/README.md +++ b/README.md @@ -39,14 +39,16 @@ Currently, only Python 3 is supported. The following dependencies are *required* to install and use `rrcf`: -- [numpy](http://www.numpy.org/) +- [numpy](http://www.numpy.org/) (>= 1.15) The following *optional* dependencies are required to run the examples shown in the documentation: -- [pandas](https://pandas.pydata.org/) -- [scipy](https://www.scipy.org/) -- [scikit-learn](https://scikit-learn.org/stable/) -- [matplotlib](https://matplotlib.org/) +- [pandas](https://pandas.pydata.org/) (>= 0.23) +- [scipy](https://www.scipy.org/) (>= 1.2) +- [scikit-learn](https://scikit-learn.org/stable/) (>= 0.20) +- [matplotlib](https://matplotlib.org/) (>= 3.0) + +Listed version numbers have been tested and are known to work (this does not necessarily preclude older versions). ## Robust random cut trees @@ -234,4 +236,36 @@ for index, point in enumerate(points): ## Contributing -To contribute, submit a pull request to the `dev` branch. +We welcome contributions to the `rrcf` repo. To contribute, submit a [pull request](https://help.github.com/en/articles/about-pull-requests) to the `dev` branch. + +#### Types of contributions + +Some suggested types of contributions include: + +- Bug fixes +- Documentation improvements +- Performance enhancements +- Extensions to the algorithm + +Check the issue tracker for any specific issues that need help. If you encounter a problem using `rrcf`, or have an idea for an extension, feel free to raise an issue. + +#### Guidelines for contributors + +Please consider the following guidelines when contributing to the codebase: + +- Ensure that any new methods, functions or classes include docstrings. Docstrings should include a description of the code, as well as descriptions of the inputs (arguments) and outputs (returns). Providing an example use case is recommended (see existing methods for examples). +- Write unit tests for any new code and ensure that all tests are passing with no warnings. Please ensure that overall code coverage does not drop below 80%. + +#### Running unit tests + +To run unit tests, first ensure that `pytest` and `pytest-cov` are installed: + +``` +$ pip install pytest pytest-cov +``` + +To run the tests, navigate to the root directory of the repo and run: + +``` +$ pytest --cov=rrcf/ +``` diff --git a/docs/_data/nav.yml b/docs/_data/nav.yml index 2e49888..17011fe 100644 --- a/docs/_data/nav.yml +++ b/docs/_data/nav.yml @@ -19,6 +19,8 @@ toc: url: /rrcf/scoring-rctree.html - page: API documentation url: /rrcf/rctree-api.html + - page: Caveats and gotchas + url: /rrcf/caveats.html - title: Examples subfolderitems: - page: Batch detection diff --git a/docs/caveats.md b/docs/caveats.md new file mode 100644 index 0000000..a245e48 --- /dev/null +++ b/docs/caveats.md @@ -0,0 +1,27 @@ +# Caveats and gotchas + +## Scaling of dimensions + +The RRCF algorithm considers the relative scale of each dimension when constructing robust random cut trees. This means that dimensions with less variability (on an absolute scale) will affect the outlier score of a point less than dimensions with higher variability. + +This consideration is important to remember if each dimension represents a different categorical property or is measured with a different set of units. Consider, for example the following dataset. + +| Person | Height (in) | Weight (lb) | Age (yr) | +| ------------| ------------ | ------------- | ----------- | +| Alice | 61 | 105 | 34 | +| Bob | 70 | 300 | 50 | +| Timmy | 48 | 70 | 10 | +| Nosferatu | 75 | 180 | 170 | + +In this case, `Weight` will influence the outlier score most, because the range between the maximum and minimum values is largest (300 - 70 = 230). However, looking at the table, age seems like the most intuitive category for determining the outlier (in this case, Nosferatu is more than three times as old as the second-oldest person). + +In cases where each column is measured in different units, or measures a different type of quantity, it may be necessary to scale each column before constructing the random cut tree. For example, min-max scaling each column between zero and one yields: + +| Person | Height (-) | Weight (-) | Age (-) | +| ------------| ------------ | ------------- | ----------- | +| Alice | 0.48 | 0.15 | 0.15 | +| Bob | 0.81 | 1.0 | 0.25 | +| Timmy | 0.0 | 0.0 | 0.0 | +| Nosferatu | 1.0 | 0.48 | 1.0 | + +Other scaling methods may suit other datasets better (for instance, scaling each dimension to a mean of zero and a standard deviation of one). The user should experiment with different scalings to determine the method that works best for the task at hand. diff --git a/docs/classification.md b/docs/classification.md index 1fd1d2a..4befbfe 100644 --- a/docs/classification.md +++ b/docs/classification.md @@ -70,7 +70,8 @@ avg_codisp /= num_trees ```python predictions = np.argmin(avg_codisp, axis=1) -test_error = 1 - ((predictions == labels).sum()/num_points) +test_error = 1 - ((predictions == labels).sum() + /num_points) print("Test error: {:.1f}%".format(100*test_error)) ``` @@ -83,13 +84,19 @@ Test error: 0.0% ```python fig = plt.figure(figsize=(8,6)) ax = fig.add_subplot(111, projection='3d') -ax.scatter(X_0[:,0], X_0[:,1], X_0[:,2], c='0.5', alpha=0.3, +ax.scatter(X_0[:,0], X_0[:,1], X_0[:,2], + c='0.5', alpha=0.3, label='Training data') -ax.scatter(X_1[:,0], X_1[:,1], X_1[:,2], c='0.5', alpha=0.3) -ax.scatter(x[predictions == 0][:,0], x[predictions == 0][:,1], - x[predictions == 0][:,2], c='b', label='Class 0') -ax.scatter(x[predictions == 1][:,0], x[predictions == 1][:,1], - x[predictions == 1][:,2], c='r', label='Class 1') +ax.scatter(X_1[:,0], X_1[:,1], X_1[:,2], + c='0.5', alpha=0.3) +ax.scatter(x[predictions == 0][:,0], + x[predictions == 0][:,1], + x[predictions == 0][:,2], + c='b', label='Class 0') +ax.scatter(x[predictions == 1][:,0], + x[predictions == 1][:,1], + x[predictions == 1][:,2], + c='r', label='Class 1') plt.title('Classification results', size=14) plt.legend(frameon=True) plt.tight_layout() @@ -110,10 +117,10 @@ x = nuc['x'].astype(float).T y = nuc['y'].astype(float).T y = pd.Series({-1:0, 1:1})[y.ravel()].values -plt.scatter(x[y == 0][:,0], x[y == 0][:,1], c='b', alpha=0.3, - label='Class 0') -plt.scatter(x[y == 1][:,0], x[y == 1][:,1], c='r', alpha=0.3, - label='Class 1') +plt.scatter(x[y == 0][:,0], x[y == 0][:,1], + c='b', alpha=0.3, label='Class 0') +plt.scatter(x[y == 1][:,0], x[y == 1][:,1], + c='r', alpha=0.3, label='Class 1') plt.title('Original labeled data', size=14) plt.xlabel('Total energy') plt.ylabel('Tail energy') @@ -134,8 +141,10 @@ d = 2 num_trees = 60 # Take random sample -X_0 = x[np.random.choice(np.flatnonzero(y.ravel() == 0), size=n)] -X_1 = x[np.random.choice(np.flatnonzero(y.ravel() == 1), size=n)] +X_0 = x[np.random.choice(np.flatnonzero(y.ravel() == 0), + size=n)] +X_1 = x[np.random.choice(np.flatnonzero(y.ravel() == 1), + size=n)] # Create random cut forests forest_0 = [] @@ -177,10 +186,13 @@ Test error: 9.0% ```python plt.scatter(X_0[:,0], X_0[:,1], c='0.5', alpha=0.1) -plt.scatter(X_1[:,0], X_1[:,1], c='0.5', alpha=0.1, label='Training data') -plt.scatter(x[ix][predictions == 0][:,0], x[ix][predictions == 0][:,1], +plt.scatter(X_1[:,0], X_1[:,1], c='0.5', alpha=0.1, + label='Training data') +plt.scatter(x[ix][predictions == 0][:,0], + x[ix][predictions == 0][:,1], c='b', alpha=0.4, label='Class 0') -plt.scatter(x[ix][predictions == 1][:,0], x[ix][predictions == 1][:,1], +plt.scatter(x[ix][predictions == 1][:,0], + x[ix][predictions == 1][:,1], c='r', alpha=0.4, label='Class 1') plt.title('Classified points', size=14) plt.xlabel('Total energy') @@ -207,14 +219,18 @@ for _ in range(num_trees): forest_0.append(tree_0) forest_1.append(tree_1) -points = np.vstack(np.dstack(np.meshgrid(np.linspace(0, 8, 100), - np.linspace(0, 1.4, 100)))) + points = np.vstack(np.dstack(np.meshgrid( + np.linspace(0, 8, 100), + np.linspace(0, 1.4, 100)))) + avg_codisp = np.zeros((nn, d)) for index in range(nn): for tree_0, tree_1 in zip(forest_0, forest_1): - tree_0.insert_point(points[index], index=n + index) - tree_1.insert_point(points[index], index=n + index) + tree_0.insert_point(points[index], + index=n + index) + tree_1.insert_point(points[index], + index=n + index) avg_codisp[index,0] += tree_0.codisp(n + index) avg_codisp[index,1] += tree_1.codisp(n + index) tree_0.forget_point(n + index) @@ -227,9 +243,10 @@ avg_codisp /= num_trees ```python fig, ax = plt.subplots(figsize=(10,6)) -plt.imshow(-np.log(avg_codisp[:,1] / avg_codisp[:,0]).reshape(100, 100), - cmap='seismic', extent=(0, 8, 0, 1.4), origin='lower', - aspect='auto') +plt.imshow(-np.log(avg_codisp[:,1] / + avg_codisp[:,0]).reshape(100, 100), + cmap='seismic', extent=(0, 8, 0, 1.4), + origin='lower', aspect='auto') plt.colorbar(label='Log ratio of Class 1 Codisp to Class 0 Codisp') plt.grid('off') plt.title('Decision regions', size=16) @@ -245,12 +262,14 @@ plt.tight_layout() ```python fig, ax = plt.subplots(figsize=(10,6)) -plt.imshow(np.log(np.min(avg_codisp, axis=1)).reshape(100, 100), +plt.imshow(np.log(np.min(avg_codisp, + axis=1)).reshape(100, 100), extent=(0, 8, 0, 1.4), origin='lower', aspect='auto', cmap='cubehelix_r') plt.colorbar(label='$\log(\min(CoDisp(x^{(0)}), CoDisp(x^{(1)})))$') plt.grid('off') -plt.title('Likelihood of belonging to neither class', size=14) +plt.title('Likelihood of belonging to neither class', + size=14) plt.xlabel('Total energy') plt.ylabel('Tail energy') plt.tight_layout() diff --git a/docs/comparisons.md b/docs/comparisons.md index 77b1fba..d03049d 100644 --- a/docs/comparisons.md +++ b/docs/comparisons.md @@ -34,30 +34,42 @@ n_inliers = n_samples - n_outliers # Outlier detectors from sklean plot anomaly_algorithms = [ - ("Robust covariance", EllipticEnvelope(contamination=outliers_fraction)), - ("One-Class SVM", svm.OneClassSVM(nu=outliers_fraction, - kernel="rbf", - gamma=0.1)), - ("Isolation Forest", IsolationForest(contamination=outliers_fraction, - behaviour='new')), - ("Local Outlier Factor", LocalOutlierFactor(n_neighbors=35, - contamination=outliers_fraction))] + ("Robust covariance", + EllipticEnvelope(contamination=outliers_fraction)), + ("One-Class SVM", + svm.OneClassSVM(nu=outliers_fraction, + kernel="rbf", + gamma=0.1)), + ("Isolation Forest", + IsolationForest(contamination=outliers_fraction, + behaviour='new')), + ("Local Outlier Factor", + LocalOutlierFactor(n_neighbors=35, + contamination=outliers_fraction))] # Define datasets -blobs_params = dict(random_state=0, n_samples=n_inliers, n_features=2) +blobs_params = dict(random_state=0, + n_samples=n_inliers, + n_features=2) datasets = [ - make_blobs(centers=[[0, 0], [0, 0]], cluster_std=0.5,**blobs_params)[0], - make_blobs(centers=[[2, 2], [-2, -2]], cluster_std=[0.5, 0.5],**blobs_params)[0], - make_blobs(centers=[[2, 2], [-2, -2]], cluster_std=[1.5, .3],**blobs_params)[0], - 4. * (make_moons(n_samples=n_samples, noise=.05, random_state=0)[0] + make_blobs(centers=[[0, 0], [0, 0]], + cluster_std=0.5,**blobs_params)[0], + make_blobs(centers=[[2, 2], [-2, -2]], + cluster_std=[0.5, 0.5],**blobs_params)[0], + make_blobs(centers=[[2, 2], [-2, -2]], + cluster_std=[1.5, .3],**blobs_params)[0], + 4. * (make_moons(n_samples=n_samples, + noise=.05, random_state=0)[0] - np.array([0.5, 0.25])), - 14. * (np.random.RandomState(42).rand(n_samples, 2) - 0.5)] + 14. * (np.random.RandomState(42).rand(n_samples, 2) + - 0.5)] # Add outliers to the data sets outliers = [] # record keeping data = [] for i in datasets: - out = rng.uniform(low=-6, high=6, size=(n_outliers, 2)) + out = rng.uniform(low=-6, high=6, + size=(n_outliers, 2)) outliers.append(out) data.append(np.concatenate([i, out], axis=0)) @@ -75,10 +87,12 @@ for d in range(len(data)): tr1 = time.time() while len(forest) < num_trees: # Select random subsets of points uniformly from point set - ixs = np.random.choice(n, size=(n // tree_size, tree_size), + ixs = np.random.choice(n, + size=(n // tree_size, tree_size), replace=False) # Add sampled trees to forest - trees = [rrcf.RCTree(data[d][ix], index_labels=ix) for ix in ixs] + trees = [rrcf.RCTree(data[d][ix], + index_labels=ix) for ix in ixs] forest.extend(trees) # Compute average CoDisp @@ -99,7 +113,8 @@ for d in range(len(data)): t0 = time.time() algorithm.fit(data[d]) t1 = time.time() - plt.subplot(5, len(anomaly_algorithms) + 1, plot_num) + plt.subplot(5, len(anomaly_algorithms) + 1, + plot_num) if d == 0: plt.title(name, size=16) @@ -107,12 +122,14 @@ for d in range(len(data)): if name == "Local Outlier Factor": y_pred = algorithm.fit_predict(data[d]) else: - y_pred = algorithm.fit(data[d]).predict(data[d]) + y_pred = (algorithm.fit(data[d]) + .predict(data[d])) colors = np.array(['#377eb8', '#ff7f00']) plt.scatter(data[d][:, 0], data[d][:, 1], s=10, color=colors[(y_pred + 1) // 2]) - plt.text(.99, .01, ('%.2fs' % (t1 - t0)).lstrip('0'), + plt.text(.99, .01, + ('%.2fs' % (t1 - t0)).lstrip('0'), transform=plt.gca().transAxes, size=15, horizontalalignment='right') plot_num += 1 @@ -122,16 +139,17 @@ for d in range(len(data)): mask = np.percentile(avg_cod, 85) avg_cod[avg_cod < mask] = 1 avg_cod[avg_cod > mask] = 0 - c = ['#377eb8' if i == 0 else '#ff7f00' for i in avg_cod] + c = ['#377eb8' if i == 0 else '#ff7f00' + for i in avg_cod] plt.scatter(data[d][:,0], data[d][:,1], s=10, c=c) if d == 0: plt.title("RRCF", size=16) - plt.text(.99, .01, ('%.2fs' % (tr2 - tr1)).lstrip('0'), - transform=plt.gca().transAxes, size=15, - horizontalalignment='right') + plt.text(.99, .01, + ('%.2fs' % (tr2 - tr1)).lstrip('0'), + transform=plt.gca().transAxes, size=15, + horizontalalignment='right') plot_num += 1 -plt.savefig('method_comparison.png', bbox_inches='tight') ``` ![Comparison](https://s3.us-east-2.amazonaws.com/mdbartos-img/rrcf/method_comparison.png) diff --git a/docs/index.md b/docs/index.md index f39bd15..80bcc42 100644 --- a/docs/index.md +++ b/docs/index.md @@ -3,7 +3,7 @@ layout: default --- # rrcf 🌲🌲🌲 -[![Build Status](https://travis-ci.org/kLabUM/rrcf.svg?branch=master)](https://travis-ci.org/kLabUM/rrcf) [![Python 3.6](https://img.shields.io/badge/python-3.6-blue.svg)](https://www.python.org/downloads/release/python-360/) +[![Build Status](https://travis-ci.org/kLabUM/rrcf.svg?branch=master)](https://travis-ci.org/kLabUM/rrcf) [![Coverage Status](https://coveralls.io/repos/github/kLabUM/rrcf/badge.svg?branch=master)](https://coveralls.io/github/kLabUM/rrcf?branch=master) [![Python 3.6](https://img.shields.io/badge/python-3.6-blue.svg)](https://www.python.org/downloads/release/python-360/) ![GitHub](https://img.shields.io/github/license/kLabUM/rrcf.svg) Implementation of the *Robust Random Cut Forest Algorithm* for anomaly detection by [Guha et al. (2016)](http://proceedings.mlr.press/v48/guha16.pdf). @@ -26,6 +26,35 @@ The *Robust Random Cut Forest* (RRCF) algorithm is an ensemble method for detect This repository provides an open-source implementation of the RRCF algorithm and its core data structures for the purposes of facilitating experimentation and enabling future extensions of the RRCF algorithm. +## Documentation + +Read the docs [here 📖](https://klabum.github.io/rrcf/). + +## Installation + +Use `pip` to install `rrcf` via pypi: + +```shell +$ pip install rrcf +``` + +Currently, only Python 3 is supported. + +### Dependencies + +The following dependencies are *required* to install and use `rrcf`: + +- [numpy](http://www.numpy.org/) (>= 1.15) + +The following *optional* dependencies are required to run the examples shown in the documentation: + +- [pandas](https://pandas.pydata.org/) (>= 0.23) +- [scipy](https://www.scipy.org/) (>= 1.2) +- [scikit-learn](https://scikit-learn.org/stable/) (>= 0.20) +- [matplotlib](https://matplotlib.org/) (>= 3.0) + +Listed version numbers have been tested and are known to work (this does not necessarily preclude older versions). + ## Robust random cut trees A robust random cut tree is a binary search tree that can be used to detect outliers in a point set. Points located nearer to the root of the tree are more likely to be outliers. @@ -218,12 +247,38 @@ for index, point in enumerate(points): ![Image](https://raw.githubusercontent.com/kLabUM/rrcf/master/resources/sine.png) -## Installation +## Contributing -To install: +We welcome contributions to the `rrcf` repo. To contribute, submit a [pull request](https://help.github.com/en/articles/about-pull-requests) to the `dev` branch. -```shell -$ pip install rrcf +#### Types of contributions + +Some suggested types of contributions include: + +- Bug fixes +- Documentation improvements +- Performance enhancements +- Extensions to the algorithm + +Check the issue tracker for any specific issues that need help. If you encounter a problem using `rrcf`, or have an idea for an extension, feel free to raise an issue. + +#### Guidelines for contributors + +Please consider the following guidelines when contributing to the codebase: + +- Ensure that any new methods, functions or classes include docstrings. Docstrings should include a description of the code, as well as descriptions of the inputs (arguments) and outputs (returns). Providing an example use case is recommended (see existing methods for examples). +- Write unit tests for any new code and ensure that all tests are passing with no warnings. Please ensure that overall code coverage does not drop below 80%. + +#### Running unit tests + +To run unit tests, first ensure that `pytest` and `pytest-cov` are installed: + +``` +$ pip install pytest pytest-cov ``` -Currently, only Python 3 is supported. +To run the tests, navigate to the root directory of the repo and run: + +``` +$ pytest --cov=rrcf/ +``` diff --git a/rrcf/__init__.py b/rrcf/__init__.py index acdf22c..76d8e1e 100644 --- a/rrcf/__init__.py +++ b/rrcf/__init__.py @@ -1,3 +1,3 @@ from rrcf.rrcf import * from rrcf.shingle import shingle -__version__ = "0.2" +__version__ = "0.3" diff --git a/setup.py b/setup.py index c8a338d..8a9cd36 100644 --- a/setup.py +++ b/setup.py @@ -3,7 +3,7 @@ from setuptools import setup setup(name='rrcf', - version='0.2', + version='0.3', description='Robust random cut forest for anomaly detection', author='Matt Bartos, Abhiram Mullapudi, Sara Troutman', author_email='mdbartos@umich.edu, abhiramm@umich.edu, stroutm@umich.edu',