Error in harmonizationApply due to site differences between training and testing data #50

Yuvi-416 · 2024-07-01T13:15:21Z

Hey there,

Thank you for your exciting work.

I recently used this package for my work, and now I am a little bit confused about how to use it or whether the process I used is correct. I am doing a regression analysis, and the target prediction is Age. So, I collected healthy data from nine different sites, and then we had patients' data from two different sites.

The first question I want to ask is, do I need to apply harmonization to the testing data as well? I have already applied harmonization to the training data using the function below:

combat_model, features_train_combat = harmonizationLearn(features_train, covariates_train) # smooth_terms=['Age']
When I pass the combat_model, which I get from the above function to the testing data, I get an error:
features_test_combat = harmonizationApply(features_test, covariates_test, combat_model)

ERROR:
IndexError: index 4 is out of bounds for axis 1 with size 4
I think I am getting this error because of the site difference between the training and testing data, as the training sample has nine sites, and the testing data has two sites. So my next question is, do I need to apply harmonization separately to each training and testing dataset?

I could not find any information about how to proceed or whether we should apply harmonization to all (training and testing) at once or apply it separately.

Can you suggest what I should do?

Thank you!

The text was updated successfully, but these errors were encountered:

rpomponio · 2024-07-01T17:41:35Z

Thanks for submitting this issue. It's crucial to know whether your two testing sites are included within the nine training sites, or if the sites are entirely disjoint between training & testing.

Yuvi-416 · 2024-07-01T21:53:25Z

Hi there,
Thanks for reply.
No, testing sites are not included in the training sample.
I first use harmonizationLearn function to harmonize training data which has batch as 1-9 as I use data from 9 database, so each number for each data. And then applied learned combat_model model from training to harmonizationApply to testing data whose batch is as 1-3.

And after that I getting above mentioned error.

I found the below line in the document which pretty much explains why I am getting the error.
“ Next, prepare the holdout data on which you will apply the model. This data must look exactly like the training data for harmonizationLearn, including the same number and order of covariates. If the holdout data contains a different number of sites, an error will be thrown.“

However, I would like to know in my case what would be the feasible way to perform data harmonisation.
I have two CSV files, one for training and another for testing, and in each file has SITE name in number 1-9 for training and 1-3 for testing and there is also a Age colum in each file which I use as a covarites.

I wanted to know whether do I need to concatenation both file and apply harmonisation of them or do it separately.

rpomponio · 2024-07-02T12:41:30Z

This method is not designed to harmonize sites that are not part of the training data.

That said, you should be able to fix the error by including all sites in the training and testing sets. Typically, users will designate a subset of their data (i.e., healthy controls) to train the harmonization model. Then, they will apply the model to all data (i.e., patients and healthy controls).

If some of your sites only contain patients, then unfortunately this method will not be suitable. This is a known limitation of statistical harmonization methods and I am not aware of a method that appropriately addresses this situation.

I am going to close this issue, but please feel free to re-open it if you have further questions.

RituBC · 2025-01-07T07:23:37Z

Hey, thanks for the awesome work.
I recently used this package for the longitudinal data sets (Pre and Post ) and the script works fine.
However now I am working on applying it to training and testing data sets by holding out few number of data. Currently, the number of data is 197 in training and 29 in testing. So when I use harmonization learn and harmonization apply it displays the following error, "ValueError: operands could not be broadcast together with shapes (153,29) (153,197) " does that mean the training and testing data sets should be equal number? When I tried with same number of data set (197 and 197) by inserting dummy data in the test set it works fine. Please suggest.

rpomponio · 2025-01-16T22:09:29Z

Please supply more details. What are the dimensions of the training data and testing data? Also, what are the dimensions of the training and testing covariates?

Including your code may help diagnose the issue.

rpomponio closed this as completed Jul 2, 2024

rpomponio reopened this Jan 16, 2025

rpomponio added the question Further information is requested label Jan 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error in harmonizationApply due to site differences between training and testing data #50

Error in harmonizationApply due to site differences between training and testing data #50

Yuvi-416 commented Jul 1, 2024

rpomponio commented Jul 1, 2024

Yuvi-416 commented Jul 1, 2024 •

edited

Loading

rpomponio commented Jul 2, 2024

RituBC commented Jan 7, 2025

rpomponio commented Jan 16, 2025

Error in harmonizationApply due to site differences between training and testing data #50

Error in harmonizationApply due to site differences between training and testing data #50

Comments

Yuvi-416 commented Jul 1, 2024

rpomponio commented Jul 1, 2024

Yuvi-416 commented Jul 1, 2024 • edited Loading

rpomponio commented Jul 2, 2024

RituBC commented Jan 7, 2025

rpomponio commented Jan 16, 2025

Yuvi-416 commented Jul 1, 2024 •

edited

Loading