Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparse table support - new feature #71

Closed
SoccerGeekPhD opened this issue Jan 23, 2024 · 2 comments
Closed

Sparse table support - new feature #71

SoccerGeekPhD opened this issue Jan 23, 2024 · 2 comments

Comments

@SoccerGeekPhD
Copy link

The wide data frame returned by comorbidity may consume a lot of memory and does not match the input format.

Would changing the output to a sparse format of two columns {ID, Name} just like the input to the function be helpful for memory and performance?

For example, my data science team in a large insurance company takes a claim with up to 9 ICD10 codes on a claim with a specific service date. This claim is pivoted to {patientID, ICD10, date, ...} then filtered to unique {patientID, ICD10} over a time frame to create the input to comorbidity(). So can you keep this narrow format for output?

@ellessenne
Copy link
Owner

Hi, thanks for reporting this. I wouldn't change the default behavior, but I am keen on adding an option to return a long format dataset instead. Let me have a look into it!

@ellessenne ellessenne added this to the 1.1.0 milestone May 17, 2024
@ellessenne ellessenne removed this from the 1.1.0 milestone May 25, 2024
@ellessenne
Copy link
Owner

Hi again @SoccerGeekPhD, I experimented a little bit on this and can share the following.

  1. First, for background, all internal calculations are already done in long format, which is then "reshaped wide" for the output;

  2. I tried omitting the final step of the algorithm (the "reshape wide" part), and that reduces computation time by <10%. We are talking about a reduction of <1 second on a test dataset with 10,000,000 codes from 100,000 subjects.

  3. Moreover, the resulting dataset in long format uses about the same amount of memory compared to the wide dataset - actually a little more, 7.2 MB vs 7.38 MB for the wide and long output formats, respectively. The main difference is that the long datasets needs to store information regarding which comorbidity condition was detected for every input subject, which takes more memory than storing ones and zeros in the wide output format.

  4. Updating the internal algorithm to use sparse matrices (e.g., relying on Matrix::sparseMatrix()) would be a huge undertaking as it requires rebuilding and refactoring the entire package from the ground up, which I definitely do not have the time to do at the moment.

I was wondering, do you have any specific benchmark on the performance of the package that you can share here? I will close this issue for now, but feel free to reopen it if you have more details to share.

Alternatively, I would recommend checking out the {icd} package (currently only available on GitHub), which uses sparse matrix multiplication and compiled code under the hood, see e.g. this draft paper. Note that the {comorbidity} package is much faster now, so the benchmarks in the article are not accurate anymore.

I hope this helps,

Alessandro

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants