Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

R2 values of -inf when no neutralization is observed (i.e. a flat line at 1) #55

Closed
anloes opened this issue Mar 22, 2024 · 1 comment · Fixed by #57
Closed

R2 values of -inf when no neutralization is observed (i.e. a flat line at 1) #55

anloes opened this issue Mar 22, 2024 · 1 comment · Fixed by #57

Comments

@anloes
Copy link

anloes commented Mar 22, 2024

When a virus is not neutralized, this results in fraction infectivity values which can be fit by flat line at 1.0. In the current iteration of neutcurve, it appears that this type of data can result in a fit with an r2 value of negative infinity. This means that filtering on a minimum r2 value would remove these values from analysis. This is not ideal, as non-neutralization of a given strain is a reasonable and expected result in many cases.

@jbloom
Copy link
Member

jbloom commented Mar 24, 2024

This really reflects a conceptual as much as technical issue.

The r2 value is the coefficient of determination, which reflects how much of the total variation in the data (the total sum of squares of the data, $\sum_i \left(y_i - \langle y \rangle\right)$) is explained by the fit sigmoid.

The total variation in the data is just the summed squared difference of all of the data points (fraction infectivity at each concentration) from a straight horizontal line drawn at the mean fraction infectivity.

The fraction of the variation not explained by the fit is the sum squared residuals.

So when all the data points are on a perfectly straight line, both the variation in the data and the residuals are zero. So arguably, yes, this should give a value of 1 for the r2 rather than negative infinity, and I will fix this.

But more generally, when the data fall along a straight line with just a tiny bit of noise (jitter), then the r2 will be zero (or close to zero) since a sigmoid cannot fit the data better than a straight line.

But really, if we are looking at data with no neutralization, we don't want to classify this as a bad fit.

So I think really we need QC to include two quantities:

  • the coefficient of determination (r2)
  • the root mean squared deviation (rmsd), which quantifies the absolute amount that the points vary from the data

For non-neutralized data, we could have a good fit with a very poor r2, but then the rmsd will be very small. So I will add computation of that from the curve fits, and then curves can be QC-ed by looking at both the r2 and rmsd.
A fit should be considered good if either r2 is close to one rmsd is close to zero.

@jbloom jbloom linked a pull request Mar 25, 2024 that will close this issue
jbloom added a commit that referenced this issue Mar 25, 2024
Improvements to metrics for assessing curve fit (see [here](#55 (comment))):
  - The coefficient of determination (``r2``) now is one if all points are fit by a straight line, rather than engative infinity.
  - A root-mean-square-deviation (square root of mean residual) is now calculated as the ``rmsd`` attribute of ``HillCurve`` objects and reported in fit parameter summaries from ``CurveFits``.
jbloom added a commit to jbloomlab/seqneut-pipeline that referenced this issue Mar 26, 2024
- In `process_plate_curvefit_qc` in the YAML configuration, there is a new key called `goodness_of_fit` and now both `min_R2` (the minimum coefficient of determination) and `max_RMSD` (the maximum mean square deviation) for each curve fit are specified as keys under that. The curves are then filtered to retain only those that meet *either* of these criteria (so must fail both to be dropped). Addresses [this issue](#33) and [this issue](jbloomlab/neutcurve#55 (comment)). Alongside this change, the `rmsd` is now reported in key output files. Also, in the tabulation of failures, `fails_min_R2` now becomes `fails_goodness_of_fit`.
  - This is a **backward-incompatible change** in the configuration YAML. Previously `min_R2` was a standalone key under `process_plate_curvefit_qc`; now `goodness_of_fit` is the required key and `min_R2` and `max_RMSD` are required keys under it.

- Added another plate (of H3N2 rather than H1N1) to the `test_example` to test some of the changes introduced in this version.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants