Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On disk layers in H5AD file #63

Closed
GabrielHoffman opened this issue May 13, 2022 · 7 comments
Closed

On disk layers in H5AD file #63

GabrielHoffman opened this issue May 13, 2022 · 7 comments
Labels
bug Something isn't working

Comments

@GabrielHoffman
Copy link
Contributor

GabrielHoffman commented May 13, 2022

I have an H5AD file that stores both normalized data and raw counts produced by pegasus. I can use zellkonverter to read the default normalized counts as a DelayedMatrix, but the raw counts are imported as a dgCMatrix. How can I use a DelayedMatrix instead?

This follows up on our conversion in #57, but applied to the new H5AD format.

As for the details, I have an H5AD file with the structure:

> h5ls example.h5ad
X                        Group
layers                   Group
obs                      Group
obsm                     Group
obsp                     Group
raw                      Group
uns                      Group
var                      Group
varm                     Group
varp                     Group

where X stores normalized data and layers/raw_new stores the raw counts.

I read the data in using:

sce = readH5AD(file, use_hdf5=TRUE, verbose=TRUE, layers=TRUE) 

class(assay(sce, "X"))
# [1] "DelayedMatrix"
# attr(,"package")
# [1] "DelayedArray"

class(assay(sce, "raw_new"))
# [1] "dgCMatrix"
# attr(,"package")
# [1] "Matrix"

# object.size(assay(sce, "X"))
# 113882552 bytes

object.size(assay(sce, "raw_new"))
# 12781499800 bytes

The raw_new field is a 12Gb dgCMatrix.

I have zellkonverter v1.7.0, Using anndata version 0.8.0

Cheers,
Gabriel

@lazappi lazappi added the bug Something isn't working label May 16, 2022
@lazappi
Copy link
Member

lazappi commented May 16, 2022

Hi @GabrielHoffman

I think I might know what's going on here but I need to look into it more. Do you have an example file you would be happy to share? A small subset of this data would be perfect.

@GabrielHoffman
Copy link
Contributor Author

After a lot of digging and working with my colleague, I figured out it's an issue with the way the H5AD was generated with pegasus. Instead of using the standard raw field, using a custom field name causes the raw counts to be saved to layers/customField. The custom field was name raw_new so I didn't realize that it was non-standard.

Here is some code to reproduce the H5AD file with this issue.

import pegasus as pg
import pandas as pd

# wget https://storage.googleapis.com/terra-featured-workspaces/Cumulus/MantonBM_nonmix_subset.zarr.zip
data = pg.read_input('MantonBM_nonmix_subset.zarr.zip')
pg.identify_robust_genes(data)

# Transform counts, but retain original in backup_matrix
# The default is raw.X which saves to h5ad_file/raw like we expect
# When backup_matrix is set to a custom value, the result is saved in layers/customField
# This is what causes the problem, since h5ad_file/raw is no longer written
# https://pegasus.readthedocs.io/en/stable/api/pegasus.log_norm.html#pegasus.log_norm
pg.log_norm(data, backup_matrix = 'raw_new')

pg.write_output( data, "out.h5ad")

Based on this, it's not a zellkonverter issue. It can be resolved by using standard field names, since only standard fields are read as a DelayedMatrix.

Agree?

Best,
Gabriel

@lazappi
Copy link
Member

lazappi commented May 17, 2022

Thanks! I still need to test things but I think that makes sense. We should actually be able to support DelayedMatrix in any layers item but there was an upstream issue which meant we turned it off. That should be fixed now and #50 is about turning it back on but we didn't get around to it before the last release. Now that it looks like at least some people want to use it it should be more of a priority though.

@lcolladotor
Copy link

Hi!

This is just a longer 👀 message ;)

@Nick-Eagles @abspangler13 and I are going to be using some of the same files Gabriel is using, and so, we will have the same issues Gabriel described. Thank you Gabriel et al for spearheading this and thanks Luke for your support! From our side, Nick is the one who has R and Python experience, and thus has been our in house zellkonverter expert =) Please let us know if we can help in any way.

Best,
Leo

@lazappi
Copy link
Member

lazappi commented May 24, 2022

The main thing that would be helpful would be an example .h5ad file where this is an issue (I don't use pegasus so can't run the code above). Apart from that I need to look into turning the DelayedArray support for layers other than X back on.

@GabrielHoffman
Copy link
Contributor Author

GabrielHoffman commented May 26, 2022

We have worked out the issue on our end, by writing to raw, so readH5AD() works aspected.

Thanks!

@lcolladotor
Copy link

Thanks Gabriel, Prashant, @Nick-Eagles et al for figuring this one out!

Thanks again Luke for the support ^^.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants