Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in harmonizationApply due to site differences between training and testing data #50

Open
Yuvi-416 opened this issue Jul 1, 2024 · 5 comments
Labels
question Further information is requested

Comments

@Yuvi-416
Copy link

Yuvi-416 commented Jul 1, 2024

Hey there,

Thank you for your exciting work.

I recently used this package for my work, and now I am a little bit confused about how to use it or whether the process I used is correct. I am doing a regression analysis, and the target prediction is Age. So, I collected healthy data from nine different sites, and then we had patients' data from two different sites.

  • The first question I want to ask is, do I need to apply harmonization to the testing data as well? I have already applied harmonization to the training data using the function below:

combat_model, features_train_combat = harmonizationLearn(features_train, covariates_train) # smooth_terms=['Age']
When I pass the combat_model, which I get from the above function to the testing data, I get an error:
features_test_combat = harmonizationApply(features_test, covariates_test, combat_model)

ERROR:
IndexError: index 4 is out of bounds for axis 1 with size 4
I think I am getting this error because of the site difference between the training and testing data, as the training sample has nine sites, and the testing data has two sites. So my next question is, do I need to apply harmonization separately to each training and testing dataset?

I could not find any information about how to proceed or whether we should apply harmonization to all (training and testing) at once or apply it separately.

Can you suggest what I should do?

Thank you!

@rpomponio
Copy link
Owner

Thanks for submitting this issue. It's crucial to know whether your two testing sites are included within the nine training sites, or if the sites are entirely disjoint between training & testing.

@Yuvi-416
Copy link
Author

Yuvi-416 commented Jul 1, 2024

Hi there,
Thanks for reply.
No, testing sites are not included in the training sample.
I first use harmonizationLearn function to harmonize training data which has batch as 1-9 as I use data from 9 database, so each number for each data. And then applied learned combat_model model from training to harmonizationApply to testing data whose batch is as 1-3.

And after that I getting above mentioned error.

I found the below line in the document which pretty much explains why I am getting the error.
“ Next, prepare the holdout data on which you will apply the model. This data must look exactly like the training data for harmonizationLearn, including the same number and order of covariates. If the holdout data contains a different number of sites, an error will be thrown.“

However, I would like to know in my case what would be the feasible way to perform data harmonisation.
I have two CSV files, one for training and another for testing, and in each file has SITE name in number 1-9 for training and 1-3 for testing and there is also a Age colum in each file which I use as a covarites.

I wanted to know whether do I need to concatenation both file and apply harmonisation of them or do it separately.

@rpomponio
Copy link
Owner

This method is not designed to harmonize sites that are not part of the training data.

That said, you should be able to fix the error by including all sites in the training and testing sets. Typically, users will designate a subset of their data (i.e., healthy controls) to train the harmonization model. Then, they will apply the model to all data (i.e., patients and healthy controls).

If some of your sites only contain patients, then unfortunately this method will not be suitable. This is a known limitation of statistical harmonization methods and I am not aware of a method that appropriately addresses this situation.

I am going to close this issue, but please feel free to re-open it if you have further questions.

@RituBC
Copy link

RituBC commented Jan 7, 2025

Hey, thanks for the awesome work.
I recently used this package for the longitudinal data sets (Pre and Post ) and the script works fine.
However now I am working on applying it to training and testing data sets by holding out few number of data. Currently, the number of data is 197 in training and 29 in testing. So when I use harmonization learn and harmonization apply it displays the following error, "ValueError: operands could not be broadcast together with shapes (153,29) (153,197) " does that mean the training and testing data sets should be equal number? When I tried with same number of data set (197 and 197) by inserting dummy data in the test set it works fine. Please suggest.

@rpomponio rpomponio reopened this Jan 16, 2025
@rpomponio rpomponio added the question Further information is requested label Jan 16, 2025
@rpomponio
Copy link
Owner

Please supply more details. What are the dimensions of the training data and testing data? Also, what are the dimensions of the training and testing covariates?

Including your code may help diagnose the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants