Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move information on numerical/non_numerical/encoded_non_numerical from .uns to .var #630

Merged
merged 29 commits into from
Dec 19, 2023

Conversation

eroell
Copy link
Collaborator

@eroell eroell commented Dec 18, 2023

PR Checklist

  • This comment contains a description of changes (with reason)
  • Referenced issue is linked
  • If you've fixed a bug or added code that should be tested, add tests!
  • Documentation in docs is updated

Description of changes
Resolves #620. Uses .var['ehrapy_column_type'] instead of .uns for numeric, non_numeric, and non_numeric_encoded type identification of variables.

Technical details
Instead of .uns['numerical_columns'], .uns['non_numerical_columns'], .uns['non_numerical_encoded_columns'], one column in .var containing values numeric, non_numeric, or non_numeric_encoded is used.

Reading in or transferring csv files, and a wide variety of users interacting with the AnnData object and ehrapy is not affected. However, backwards compatibility is not strictly maintained with this update. E.g. custom modifications to the .uns['numerical_columns'] etc will break with this update.

Additional context
Putting this information to the variable level in .var allows for e.g. slicing, and reduces overhead of keeping .uns in sync with the variables when selecting/moving variables.

Old example of creating a dummy dataset:

import ehrapy as ep
import numpy as np
import pandas as pd
import scanpy as sc

def create_dummy_dataset_numerical_in_obs():
    """
    Create a dummy dataset with numerical and non-numerical variables in obs.
    Also, has numerical variables in .X.
    """
    dummy_obs = {"disease":['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'],
                "station": ['ICU', 'ICU', 'MICU', 'MICU', 'ICU', 'ICU', 'MICU', 'MICU'],
                "syst_bp_entry": [138, 139, 140, 141, 148, 149, 150, 151],
                "diast_bp_entry": [78, 79, 80, 81, 77, 78, 79, 80]}

    dummy_var = pd.DataFrame({"Unit": ["mg/dl", "kg"],
                              "ehrapy_column_type": ["numerical", "numerical"]})
    dummy_var.index = ['glucose', "weight"]
    dummy_X = np.array([[80, 90, 120, 130, 80, 130, 120, 90],
                        [77, 76, 60, 90, 110, 78, 56, 76]]).T

    adata_dummy = sc.AnnData(X=dummy_X, obs=dummy_obs, var=dummy_var)
    
    adata_dummy.uns['numerical_columns'] = ['glucose', 'weight']
    adata_dummy.uns['non_numerical_columns'] = []
    adata_dummy.uns['encoded_non_numerical_columns'] = []
    
    return adata_dummy

adata_dummy = create_dummy_dataset_numerical_in_obs()

New example of creating a dummy dataset:

import ehrapy as ep
import numpy as np
import pandas as pd
import scanpy as sc

def create_dummy_dataset_numerical_in_obs():
    """
    Create a dummy dataset with numerical and non-numerical variables in obs.
    Also, has numerical variables in .X.
    """
    dummy_obs = {"disease":['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'],
                "station": ['ICU', 'ICU', 'MICU', 'MICU', 'ICU', 'ICU', 'MICU', 'MICU'],
                "syst_bp_entry": [138, 139, 140, 141, 148, 149, 150, 151],
                "diast_bp_entry": [78, 79, 80, 81, 77, 78, 79, 80]}

    dummy_var = pd.DataFrame({"Unit": ["mg/dl", "kg"]})
    dummy_var.index = ['glucose', "weight"]
    dummy_X = np.array([[80, 90, 120, 130, 80, 130, 120, 90],
                        [77, 76, 60, 90, 110, 78, 56, 76]]).T

    adata_dummy = sc.AnnData(X=dummy_X, obs=dummy_obs, var=dummy_var)
    
    return adata_dummy

adata_dummy = create_dummy_dataset_numerical_in_obs()

Copy link
Member

@Zethson Zethson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great!

We're not using the chatGPT/vscode/something autogenerated types in the docstrings because we only want to use type hints in the function header. Else, it's 2x the maintenance!

ehrapy/anndata/anndata_ext.py Show resolved Hide resolved
ehrapy/anndata/anndata_ext.py Show resolved Hide resolved
ehrapy/anndata/anndata_ext.py Outdated Show resolved Hide resolved
ehrapy/anndata/anndata_ext.py Outdated Show resolved Hide resolved
ehrapy/anndata/anndata_ext.py Outdated Show resolved Hide resolved
eroell and others added 5 commits December 18, 2023 22:59
Co-authored-by: Lukas Heumos <lukas.heumos@posteo.net>
No type annotation in docstrings

Co-authored-by: Lukas Heumos <lukas.heumos@posteo.net>
ehrapy/anndata/anndata_ext.py Outdated Show resolved Hide resolved
tests/preprocessing/test_normalization.py Outdated Show resolved Hide resolved
@eroell eroell marked this pull request as ready for review December 19, 2023 08:36
@Zethson Zethson merged commit 3cbeafd into theislab:main Dec 19, 2023
11 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

information on numerical vs non-numerical features to adata.var
2 participants