Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding DESC to data integration benchmarking #28

Open
LuckyMD opened this issue Jul 24, 2020 · 21 comments
Open

Adding DESC to data integration benchmarking #28

LuckyMD opened this issue Jul 24, 2020 · 21 comments

Comments

@LuckyMD
Copy link

LuckyMD commented Jul 24, 2020

Hi @eleozzr,

We were thinking about adding DESC to our benchmark of data integration tools (https://github.com/theislab/scib). We would be running our own pre-processing for the input to DESC for this, which is reliant on Scanpy version 1.4.5+. Do you think it would be possible to use just the desc.train() function if we remove the Scanpy requirement and install via github? Would this also be okay for using Keras 2.2.4?

Also, to compare the methods properly we would not be able to use the clustering output you provide, but instead we would use the embedding at a default clustering resolution (resolution=0.8 as in your tutorial). Would this be a suitable way of evaluating DESC?

Kind regards,

@eleozzr
Copy link
Owner

eleozzr commented Jul 24, 2020

Because DESC's dependency on other packages, such as TensorFlow,scanpy and keras, we are updating and testing our desc algorithm to be compatible with tensorflow2.0, scanpy 1.4.5+. Hopefully, the latest algorithm can be uploaded into GitHub and PyPI tomorrow.

The single resolution (0.8 or 1.0) is ok for DESC.

Thanks.

@LuckyMD
Copy link
Author

LuckyMD commented Jul 24, 2020

Hi @eleozzr,

That is great news! If you have it online by tomorrow, we will be able to add it to the benchmark on time :). I am looking forward to a setup.py with keras>2.1 and Scanpy>1.3.6 as dependencies :). I hope you will still be able to use tensorflow 1.X as well though, and that it's not exclusive to tensorflow 2.

@eleozzr
Copy link
Owner

eleozzr commented Jul 25, 2020

Hi LuckyMD,

I have already updated our desc algorithm.

  1. For tensorflow 1*, we released desc(2.0.3). Please see our jupyter notebook example desc_2.0.3_paul.ipynb
  2. For tensorflow 2*, we released desc(2.1.1). Please see our jupyter notebook example desc_2.1.1_paul.ipynb

Hope this helps. Thanks.

@LuckyMD
Copy link
Author

LuckyMD commented Jul 27, 2020

hi @eleozzr,

Thanks for the updates. However, I see that the scanpy version is still capped to 1.4.4 in version 2.0.3, and it seems this is removed in version 2.1.1 but interestingly the tensorflow version is capped at 2.0 there. Could it be that this is a typo? Would it be possible to remove the scanpy version cap for 2.0.3?

@LuckyMD
Copy link
Author

LuckyMD commented Jul 27, 2020

Also, what is the limit to python 3.7 compatability?

@eleozzr
Copy link
Owner

eleozzr commented Jul 28, 2020

Also, what is the limit to python 3.7 compatability?

All the script was tested in python3.5 or python3.6, but I think the script works in python3.7.

@LuckyMD
Copy link
Author

LuckyMD commented Jul 28, 2020

Are you still planning to fix the install requirements in:

desc/setup.py

Lines 21 to 23 in 9acb047

'tensorflow>=1.7,<2.0',
'keras==2.1',
'scanpy>=1.3.6,<1.4.4',

Then I could just install from pip directly rather than making a git fork and installing via that.

@eleozzr
Copy link
Owner

eleozzr commented Jul 28, 2020

Are you still planning to fix the install requirements in:

desc/setup.py

Lines 21 to 23 in 9acb047

'tensorflow>=1.7,<2.0',
'keras==2.1',
'scanpy>=1.3.6,<1.4.4',

Then I could just install from pip directly rather than making a git fork and installing via that.

Could you install by pip install desc==2.1.1 or 2.0.3?
I haven't figured out why the code in Github is not the latest.
If you need version v2.0.3, maybe you can download it directly by https://drive.google.com/file/d/106xrwqnskG-Eu--Bv_hvc0CtKSG64Xh0/view

@LuckyMD
Copy link
Author

LuckyMD commented Jul 28, 2020

I hadn't tried to install via pip yet, as the setup.py with the 2.0.3 tag still showed these dependencies. I will give it a go and report back.

@LuckyMD
Copy link
Author

LuckyMD commented Jul 28, 2020

Install for 2.0.3 worked, thanks :). Is there a reason you are limited to keras 2.1? Just a question... I can work with that as well ;).

@LuckyMD
Copy link
Author

LuckyMD commented Jul 28, 2020

Hi @eleozzr,

I have another question about using DESC. I can't see where you pass batch information to the algorithm so that it can perform data integration. Do you explicitly integrate data across batches or just produce a low-dimensional embedding that is less affected by batch?

If DESC doesn't explicitly do data integration, but only produces a low-dimensional embedding which is less affected by batch effects than the high-dimensional data, maybe we shouldn't be comparing it to other data integration tools? I guess that comparison might not be fair to DESC.

What do you think?

@eleozzr
Copy link
Owner

eleozzr commented Jul 28, 2020

with

Install for 2.0.3 worked, thanks :). Is there a reason you are limited to keras 2.1? Just a question... I can work with that as well ;).

Sometimes, the version of TensorFlow and Keras needs to match. So I directly limit the version to avoid unnecessary issues due to the unmatch of Tensorflow and Keras.

@eleozzr
Copy link
Owner

eleozzr commented Jul 28, 2020

Hi @eleozzr,

I have another question about using DESC. I can't see where you pass batch information to the algorithm so that it can perform data integration. Do you explicitly integrate data across batches or just produce a low-dimensional embedding that is less affected by batch?

If DESC doesn't explicitly do data integration, but only produces a low-dimensional embedding which is less affected by batch effects than the high-dimensional data, maybe we shouldn't be comparing it to other data integration tools? I guess that comparison might not be fair to DESC.

What do you think?

The only batch information for desc is the batch id. Generally speaking, you should scale data within each batch instead of scaling across all cells. Here is a simple example

sc.pp.filter_cells(adata,min_genes=200)
ac.pp.filter_genes(adata,min_cells=10)
adata.raw=adata.copy()
sc.pp.normalize_per_cell(adata,counts_per_cell_after=1e4)
sc.pp.highly_variable_genes(adata,n_top_genes=1000,subset=True, inplace=True)
sc.pp.log1p(adata)
sc.pp.scale(adata,zero_center=True,max_value=6)
#When your datasets have batch effect, you should try to scale data within each batch.(Here `Group` is the batch id information) 
#adata=desc.scale_bygroup(adata,groupby="Group",max_value=6)
#then you can feed adata into desc
adata=DESC.train(adata,
            dims=[adata.shape[1],128,32], # or set 256
            tol=0.001, #suggest 0.005 when the dataset less than 5000
            n_neighbors=10,
            batch_size=256,
            louvain_resolution=[0.8,1.0], # one value is also works
            save_dir="your_result_output_dir",
            do_tsne=True,
            use_GPU=False,
            num_Cores=1,
            save_encoder_weights=True,
            save_encoder_step=2,
            use_ae_weights=False,
            do_umap=False,
            num_Cores_tsne=4,
            learning_rate=500)

Hope this helps.

@LuckyMD
Copy link
Author

LuckyMD commented Jul 28, 2020

Thanks for the example! The parametrization is slightly different to your example notebook. I will use something closer to this if you think this is a better default parametrization for datasets of 10k+ cells?

I will then add batch-specific scaling and then make a PR for the Benchmarking data integration repo here. Would it be okay if I tagged you in that PR so you could check that DESC is used as you think is correct?

@eleozzr
Copy link
Owner

eleozzr commented Jul 28, 2020

Thanks for the example! The parametrization is slightly different to your example notebook. I will use something closer to this if you think this is a better default parametrization for datasets of 10k+ cells?

I will then add batch-specific scaling and then make a PR for the Benchmarking data integration repo here. Would it be okay if I tagged you in that PR so you could check that DESC is used as you think is correct?

If you only need the embedding of desc, you can set do_tsne=False and do_umap=False.

And it will be okay if you tagged me.

@LuckyMD
Copy link
Author

LuckyMD commented Jul 28, 2020

If you only need the embedding of desc, you can set do_tsne=False and do_umap=False.

Yes, I've already changed this :). I will test the code, make a PR and then you can tell me if I'm doing something stupid ;).

@LuckyMD
Copy link
Author

LuckyMD commented Jul 29, 2020

Thing seem to be running for me now, thanks! I just quickly wanted to highlight 2 things:

  1. Installing DESC also requires pydot and GraphViz (not the python package, but the C binaries), which are not automatically installed with the keras dependency (this is a keras issue, but maybe worth highlighting somewhere).
  2. Installing GraphViz via conda installed keras version 2.3.1 for me, but the code still ran through (I am still using tensorflow version 1.14). I assume that you don't really need the pinned version you have in setup.py.

@eleozzr
Copy link
Owner

eleozzr commented Jul 30, 2020

2. setup

Thanks.

@LuckyMD
Copy link
Author

LuckyMD commented Jul 30, 2020

It would be great if you could look over our PR here: theislab/scib#131

Thanks!

@LuckyMD
Copy link
Author

LuckyMD commented Aug 5, 2020

Hi @eleozzr,
We are still having an issue with saving the network weights. I have turned off saving weights via save_encoder_weights=False, but then I set save_dir=tmp_dir as you suggest in the defaults. However, we have no permissions to save on a local directory in our server. Is there a way to turn off saving any files?

@LuckyMD
Copy link
Author

LuckyMD commented Sep 11, 2020

It would be great to get an input on the above question of how to turn off saving the weights.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants