Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Small change to derive_predictors_and_scores for speed + normalization #119

Open
wants to merge 145 commits into
base: main
Choose a base branch
from

Conversation

PalkaPuri
Copy link
Contributor

@PalkaPuri PalkaPuri commented Sep 11, 2024

I noticed that generating a combined DF of predictors and scores was taking very long for large datasets (my kernel kept dying). This operation was very compute expensive because we were using df.merge which involves searching through column values to find matching rows. However, we can get away with using something simpler like pd.concat since all predictor/score DataFrames inherit the grid from local_windows and hence the rows match by design. This should help clear the speed bottleneck.

I also updated the function to return eCDF normalized values as an option

@PalkaPuri
Copy link
Contributor Author

PalkaPuri commented Sep 25, 2024

Added option to scale values based on empirical CDF
(rescaling using min and max, which we were implementing before, does nothing as the predictors are already scaled to be between 0 and 1)

dimkab and others added 3 commits October 2, 2024 16:01
…uff (collab2 ). (#126)

* Some work on the random_foragers notebook and fixing stuff.

* Linting + completing the random_foragers notebook.

* Finished random_foragers

* Interactive plots now should be displayed in HTML

* make format

* small fixes to random foragers

* Some more tweaks + zero-index fixes.

* Hungry birds simulation updated.

* Minor.

* Completed the follower NB.

* Saves the samples from each one of the R,H,F to disk, for later plotting in a single figure.

* Comparative fig.

* Minor.

* Make lint and format

* Typos

* Improved explanations of the predictors and the scores.

* Updated the model description in the random notebook.

* Minor

* Added option for initial positions. Updated RHF.

* reviewed random

* added toc to followers

* fixed followers

* Small formulas + model updates

* small modification

* small fixes, re-ran

* fixing save and display in follower

* format lint, dilling in hungry

---------

Co-authored-by: rfl-urbaniak <rfl.urbaniak@gmail.com>
Copy link
Collaborator

@rfl-urbaniak rfl-urbaniak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please pull origin from the current version of staging, make sure you resolve all conflicts and pass all the tests.

@rfl-urbaniak rfl-urbaniak added status:awaiting response Awaiting response from creator and removed status:WIP Work-in-progress not yet ready for review labels Oct 3, 2024
@PalkaPuri PalkaPuri changed the title Small change to derive_predictors_and_scores for speed Small change to derive_predictors_and_scores for speed + normalization Oct 6, 2024
@PalkaPuri PalkaPuri added status:awaiting review Awaiting response from reviewer and removed status:awaiting response Awaiting response from creator labels Oct 7, 2024
@PalkaPuri PalkaPuri changed the base branch from staging-collab-2 to main December 12, 2024 02:21
@PalkaPuri
Copy link
Contributor Author

@rfl-urbaniak @dimkab Just finished going over this branch. The following changes were implemented:

  1. final derivedDF is now generated by concatenating each predictor/score DF on index instead of using df.merge which was more computationally intensive
  2. Added time logging for generation of derivedDF and local_windows
  3. Updated the UserWarning in case of NaNs : previously we said X frames are dropped from derivedDF. But that is incorrect as the number X corresponds to rows of the df (which may be coming from any number of frames of the data).
  4. Updated add_scaled_scores option to scale values according to the empirical CDF. Previously scaling using min/max did not do anything as all predictors/scores are already scaled to be between 0 and 1.
  5. The output of the corresponding test notebook was changed due to these updates

Note: After merging these changes we would need to run all the doc notebooks again, as 2,3, and 4 can potentially change the outputs of the cells. I did not do this just yet so as to not overwhelm you both with 100 file changes in one PR.

Copy link
Collaborator

@rfl-urbaniak rfl-urbaniak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not convinced that replacing merge with concat has no strange consequences at least for some of the predictors. As a sanity check I re-run communicators inference with your proposed modification and the posterior marginals are significantly different from the ones we currently have. I think this modification, if indeed correct, needs additional explicit tests that ensure proper functioning, involving predictors other than velocity too.

@rfl-urbaniak rfl-urbaniak added blocked do not merge until further discussion and removed status:awaiting review Awaiting response from reviewer labels Dec 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked do not merge until further discussion enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants