Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replacing pandas-profiling (deprecated) with ydata-profiling #183

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions docs_sources/index.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ These datasets cover a broad range of applications including binary/multi-class
In the interactive [plotly](https://plotly.com/) chart below, each dot represents a dataset colored based on its associated task (classification vs. regression).
In log scale, the *x* and *y* axis shows the number of observations and features respectively.
Please click on the legend to hide/show the groups of datasets.
Click on each dot to access the dataset's [pandas-profiling](https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/) report.
Click on each dot to access the dataset's [ydata-profiling](https://docs.profiling.ydata.ai/latest/) report.

*Note*: If a dataset has more than 20 features, we randomly chose 20 to be displayed in its profiling report. Therefore, please disregard the `Number of variables` in the corresponding report and, instead, use the correct `n_features` in the chart and table below.

Expand Down Expand Up @@ -84,7 +84,7 @@ ply

Browse, sort, filter and search the complete table of summary statistics below.

* Click on the dataset's name to access its [pandas-profiling](https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/) report.
* Click on the dataset's name to access its [ydata-profiling](https://docs.profiling.ydata.ai/latest/) report.

* Click on the GitHub Octocat <i class="fab fa-github"></i> to access its metadata.

Expand Down
6 changes: 3 additions & 3 deletions paper/paper.md
Original file line number Diff line number Diff line change
Expand Up @@ -122,10 +122,10 @@ API reference guides that detail all user-facing functions and variables in PMLB

## Pandas profiling reports

For each dataset, we use [`pandas-profiling`](https://pandas-profiling.github.io/pandas-profiling/) to generate summary statistic reports.
In addition to the descriptive statistics provided by the commonly-used `pandas.describe` (Python) [@McKinney2010] or `skimr::skim` (R) functions, `pandas-profiling` gives a more extensive exploration of the dataset, including correlation structure within the dataset and flagging of duplicate samples.
For each dataset, we use [`ydata-profiling`](https://docs.profiling.ydata.ai/latest/) to generate summary statistic reports.
In addition to the descriptive statistics provided by the commonly-used `pandas.describe` (Python) [@McKinney2010] or `skimr::skim` (R) functions, `ydata-profiling` gives a more extensive exploration of the dataset, including correlation structure within the dataset and flagging of duplicate samples.
Browsing a report allows users and contributors to easily assess dataset quality and make any necessary changes.
For example, if a feature is flagged by `pandas-profiling` as having a single value replicated in all samples, it is likely that this feature is uninformative for ML analysis and should be removed from the dataset.
For example, if a feature is flagged by `ydata-profiling` as having a single value replicated in all samples, it is likely that this feature is uninformative for ML analysis and should be removed from the dataset.

The profiling reports can be accessed by clicking on the dataset name in the interactive data table or the data point in the interactive chart on the PMLB website.
Alternatively, all reports can be viewed on the repository's [gh-pages](https://github.com/EpistasisLab/pmlb/tree/gh-pages/profile) branch, or generated manually by users on their local computing resources.
Expand Down
2 changes: 1 addition & 1 deletion pmlb/profiling.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
import subprocess

import pandas as pd
from pandas_profiling import ProfileReport
from ydata_profiling import ProfileReport

from .pmlb import (
fetch_data, get_updated_datasets, last_commit_message
Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ def calculate_version():
],
extras_require={
'dev': ['nose', 'numpy', 'scipy', 'tabulate', 'parameterized',
'matplotlib', 'seaborn', 'pandas-profiling'],
'matplotlib', 'seaborn', 'ydata-profiling'],
},
classifiers=[
'Intended Audience :: Developers',
Expand Down
Loading