Add total.label argument for groupingsets, cube, rollup #5973

markseeto · 2024-03-03T19:53:12Z

Closes #5351

Added a total.label argument to the rollup.data.table, cube.data.table, and groupingsets.data.table functions.

A couple of comments:

It was sometimes not clear what should be an error, what should be a warning, and what should be neither. I made choices that seem reasonable to me, but there would be alternatives that are also reasonable.
I wondered whether all.label might be a better name than total.label, since the thing being calculated won't always be a total. But I left it as total.label because the documentation uses the terminology "(sub-)totals", and the word "total" is used in other software, so "total" seems to be widely accepted.

codecov · 2024-03-03T20:02:50Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.52%. Comparing base (b6d6100) to head (b6d1cf9).

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5973      +/-   ##
==========================================
+ Coverage   97.51%   97.52%   +0.01%     
==========================================
  Files          80       80              
  Lines       14915    14987      +72     
==========================================
+ Hits        14544    14616      +72     
  Misses        371      371

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

jangorecki · 2024-03-04T14:02:13Z

As for the arg name I would keep it as "label"

jangorecki · 2024-03-04T14:05:20Z

R/groupingsets.R

@@ -57,6 +57,16 @@ groupingsets.data.table = function(x, j, by, sets, .SDcols, id = FALSE, jj, ...)
    stopf("Argument 'sets' must be a list of character vectors.")
  if (!is.logical(id))
    stopf("Argument 'id' must be a logical scalar.")
+  if (!(is.null(total.label) ||
+        (is.character(total.label) && length(total.label) == 1L) ||
+        (is.list(total.label) && all(vapply_1b(total.label, is.character)) &&


We can rule out list because AFAIU char vector will be sufficient. Please correct me if I am wrong.

@jangorecki Thanks for reviewing. To make sure I understand you correctly, do you mean that total.label = list(Region = "National", Product = "Total") should not be allowed, and it should instead be total.label = c(Region = "National", Product = "Total")? I would have to try it to be sure, but I think that would be OK.

Yes, as long as each element in the list is scalar then list is not necessary and character vector is sufficient.

@jangorecki Thanks, I'll try that.

I think I was wrong and labels must not necessarily be of character type, is that correct? Then we need a list type so mixed types can be provided.

Would be good to use such little bit more complex example so it comes up straight away.

Thanks @jangorecki. Currently it only supports character and factor grouping variables. For other types, it's left as NA.

In #5351 you wrote

Then we of course want total.label="Total" to be recycled to list(State = "Total", Group = "Total"). Of course only if all columns are character.

I interpreted this as meaning the label argument should only apply to variables of type character, but now I realise that maybe I interpreted incorrectly and you might have meant that total.label = "Total" should only be recycled to character variables, but non-character labels for non-character variables are still allowed in a list.

A couple of options (and there might be others):

Only apply the label argument to grouping variables of type character or factor, and require label = c(Region = "National", Product = "Total") instead of label = list(Region = "National", Product = "Total").

Allow the label argument to apply to grouping variables that aren't character or factor, and allow total.label to be a list so that different types can be specified for different variables.

Which do you prefer? As I mentioned above, if I had a date or integer grouping variable, I would probably still want a character total label (e.g. "Total") rather than a date or integer total label. To do this, currently the user would have to convert the variable to character or factor before using groupingsets/cube/rollup.

Maybe for non-character grouping variables, we could allow a choice of specifying a label of the same type as the variable, or a character label, and if a character label is provided then the variable will be converted to character in the output. For example, if A is character and B is integer, then label = list(A = "LabelA", B = 999L) and label(A = "LabelA", B = "LabelB") would both be allowed, and the second would result in B being character in the output. If we do that, then I think I would favour having label = "Label" apply to all variables, not just character variables.

Changing column type is not an option, column type needs to be as original columns used in 'by' and not in labels (so no for option B). Option 2. We cannot apply it to all variables because we must match column types, we don't just print results but return a data.table so saying it is only for output won't be useful. Our output is data.table and column types cannot depend on labels arg. It is the labels arg has to match to column types in the output 'by'.

By "output", I meant the data.table that is returned. Sorry for the confusion.

Thanks for clarifying that the column types have to stay the same. That makes it simpler.

Do you think the shorter form label = "Total" should be allowed, or should label always be a named list? If the shorter form is allowed, should it always have to be character, or would something like label = 999L be allowed and apply to all the integer grouping variables?

Should be allowed but for non-char column will result an NA, which anyway is likely to be expected. Similarly 999L can be also allowed but would result NA for non int columns.

Thanks, I'll give that a try.

markseeto · 2024-03-04T16:38:18Z

As for the arg name I would keep it as "label"

@jangorecki Thanks for your comment. Do you mean change it from "total.label" to "label", or keep it as "total.label"?

jangorecki · 2024-03-04T17:01:21Z

Yes, change to shorter "label" groupings sets are all about (sub-)totals so we don't need to be so explicitly in naming variable so verbosely. Possibly even better than "label" will he better to use "labels" so it is clear that multiple values are supported.

jangorecki · 2024-03-04T17:03:29Z

Very good PR. What would be even more useful is minimal example usage in NEWS entry. Including different data types for labels.

markseeto · 2024-03-04T17:06:18Z

Yes, change to shorter "label" groupings sets are all about (sub-)totals so we don't need to be so explicitly in naming variable so verbosely. Possibly even better than "label" will he better to use "labels" so it is clear that multiple values are supported.

@jangorecki I prefer "total.label" or "total.labels" because they make it easier to guess where the label is going to be used, but I don't feel strongly about this, so I'm happy to change it to "labels".

markseeto · 2024-03-04T17:07:39Z

Very good PR. What would be even more useful is minimal example usage in NEWS entry. Including different data types for labels.

@jangorecki Thanks for the feedback. I can add an example to the NEWS entry.

jangorecki · 2024-03-04T18:20:31Z

Yes, change to shorter "label" groupings sets are all about (sub-)totals so we don't need to be so explicitly in naming variable so verbosely. Possibly even better than "label" will he better to use "labels" so it is clear that multiple values are supported.

@jangorecki I prefer "total.label" or "total.labels" because they make it easier to guess where the label is going to be used, but I don't feel strongly about this, so I'm happy to change it to "labels".

Another reason why I prefer single word argument name is it does not enforce any style, like total.label or totalLabel or total_label will do.

markseeto · 2024-03-04T19:05:27Z

Yes, change to shorter "label" groupings sets are all about (sub-)totals so we don't need to be so explicitly in naming variable so verbosely. Possibly even better than "label" will he better to use "labels" so it is clear that multiple values are supported.

@jangorecki I prefer "total.label" or "total.labels" because they make it easier to guess where the label is going to be used, but I don't feel strongly about this, so I'm happy to change it to "labels".

Another reason why I prefer single word argument name is it does not enforce any style, like total.label or totalLabel or total_label will do.

OK, I can change it. Maybe "label" is better than "labels", since "labels" is a base function.

markseeto · 2024-04-04T19:04:54Z

@jangorecki I've updated the functions, tests, documentation, and news, based on our discussions. When you have time, please let me know what you think.

I wasn't sure about the correct way to update a PR. Apologies if I've done it incorrectly.

I don't understand the reason for the red cross next to the News commit. When I click on the cross, it says "R-CMD-check / windows-latest (devel) (pull_request)" was cancelled, but in the list of checks, it says that check was successful.

Thanks.

MichaelChirico · 2024-09-06T05:08:24Z

@jangorecki any final review here? I can help merge to master & other tidying.

jangorecki

haven't closely looked at the logic and test coverage but found few places which doesn't look perfect

NEWS.md

R/groupingsets.R

man/groupingsets.Rd

Add 'label' argument to the groupingsets.data.table(), cube.data.table(), and rollup.data.table() functions.

Add to the Usage, Arguments, Details, and Examples sections.

markseeto · 2024-09-21T07:56:22Z

@jangorecki Thanks for reviewing. I've made the changes you asked for.

Regarding the code-quality / lint-r check, I fixed some of them but I hope it will be OK to leave long variable names because they're more descriptive that way.

jangorecki · 2024-09-22T07:55:34Z

Please avoid force push when PR is already under reviews

markseeto · 2024-09-22T09:01:15Z

Please avoid force push when PR is already under reviews

Thanks for approving and thanks for this feedback. I don't know what "force push" means but will try to find out so I can avoid doing them.

MichaelChirico · 2024-09-23T00:18:17Z

I don't know what "force push" means

Here's what we're seeing:

Most of the time when this happens you get an error when trying git push about mismatch of remote&local, I believe the error tells you to retry with --force. If you saw something about "incompatible histories" I've also been running into the same of late, not sure why. Just try and be cognizant that --force should be avoided when possible, it makes the history harder to track.

MichaelChirico · 2024-09-23T00:24:31Z

One other note, I see this is your markseeto:master branch in your fork, in my experience it's smart to make fork edits in a dedicated branch, there are some subtle headaches to dealing with PRs in fork branches named master.

~~Anyway, you'll get an invite to be a data.table project member, so you'll be able to create branches directly on this repo going forward, so~~ You're already a member: https://github.com/orgs/Rdatatable/people/markseeto, I guess this PR was created before the invite. The point is kind of moot for data.table specifically, but it's good advice for PRs to other packages.

NEWS.md

MichaelChirico · 2024-09-23T00:34:10Z

R/groupingsets.R

+        (is.atomic(label) && length(label) == 1L) ||
+        (is.list(label) && all(vapply_1b(label, is.atomic)) && all(lengths(label) == 1L) && !is.null(names(label)))))
+    stopf("Argument 'label', if not NULL, must be a scalar or a named list of scalars.")
+  if (is.list(label) && !is.null(names(label)) && ("" %chin% names(label) || anyNA(names(label))))


is this !is.null(names(label)) check redundant? Since we have is.list(label) && ... !is.null(names(label)) in the above requirement?

I do find it a bit surprising that the above check requires "a named list of scalars" but we have a separate test for "all list elements must be named", maybe best to add in the check for ""/NA names to the above condition?

We can address this as a small follow-up PR if you agree, don't want to hold the PR back further.

is this !is.null(names(label)) check redundant?

Yes, I think you're right.

I do find it a bit surprising that the above check requires "a named list of scalars" but we have a separate test for "all list elements must be named", maybe best to add in the check for ""/NA names to the above condition?

Maybe, but with separate checks the error messages can be more specific, the second one being for the situation where label is a named list but not all elements have a name. If we combine the error messages into one, it would be something like "Argument 'label', if not NULL, must be (1) a scalar, or (2) a named list with each element being named and each element being a scalar." Or "Argument 'label', if not NULL, must be (1) a scalar, or (2) a named list with no names being "" or NA and each element being a scalar." I think these are less clear and less helpful than having separate error messages depending on the situation. If you disagree, please let me know. It's not something I feel strongly about.

R/groupingsets.R

MichaelChirico · 2024-09-23T01:34:30Z

I fixed some of them but I hope it will be OK to leave long variable names because they're more descriptive that way.

Honestly I think it's overdoing it here. Those variables have very limited scope for an error branch. I just used simple names idx/info here, I don't think the code is noticeably less readable.

MichaelChirico

Great, thanks!

markseeto · 2024-09-23T10:52:34Z

@MichaelChirico Thanks for checking and improving the code. I'll study your revised code and learn from it.

Here's what we're seeing:

Most of the time when this happens you get an error when trying git push about mismatch of remote&local, I believe the error tells you to retry with --force. If you saw something about "incompatible histories" I've also been running into the same of late, not sure why. Just try and be cognizant that --force should be avoided when possible, it makes the history harder to track.

Thanks for the explanation. I see that too. I used the GitHub GUI in the browser rather than command-line git (I haven't learned command-line git yet). I didn't explicitly select "force push", but just synced my fork and then made the changes, and GitHub automatically did the force push. Maybe I should have just changed the files without syncing the fork first.

One other note, I see this is your markseeto:master branch in your fork, in my experience it's smart to make fork edits in a dedicated branch, there are some subtle headaches to dealing with PRs in fork branches named master.

Thanks, I'll keep that in mind.

I guess this PR was created before the invite.

Yes, this PR was started before I was a member.

markseeto requested review from jangorecki and MichaelChirico as code owners March 3, 2024 19:53

MichaelChirico removed their request for review March 3, 2024 19:55

jangorecki reviewed Mar 4, 2024

View reviewed changes

markseeto closed this Apr 4, 2024

markseeto force-pushed the master branch from 3c6487b to 502c59e Compare April 4, 2024 17:46

markseeto reopened this Apr 4, 2024

This comment was marked as resolved.

Sign in to view

MichaelChirico added this to the 1.17.0 milestone Sep 6, 2024

jangorecki requested changes Sep 6, 2024

View reviewed changes

markseeto closed this Sep 21, 2024

markseeto force-pushed the master branch from b6d1cf9 to 0e257a8 Compare September 21, 2024 04:04

markseeto added 5 commits September 21, 2024 14:16

Add 'label' argument

7f51d08

Add 'label' argument to the groupingsets.data.table(), cube.data.table(), and rollup.data.table() functions.

Add tests for groupingsets/cube/rollup 'label' argument

96281c0

Add information for 'label' argument

78894ed

Add to the Usage, Arguments, Details, and Examples sections.

Add item for groupingsets/cube/rollup 'label' argument

8af6585

Make changes following linter warnings

32e6a64

markseeto reopened this Sep 21, 2024

markseeto requested a review from jangorecki September 21, 2024 19:47

jangorecki approved these changes Sep 22, 2024

View reviewed changes

avoid stop(paste0), use brackify()

9e8d3d4

update test for brackify()

a11b4fa

MichaelChirico reviewed Sep 23, 2024

View reviewed changes

NEWS.md Outdated Show resolved Hide resolved

MichaelChirico added 2 commits September 22, 2024 17:29

lowercase 'r' in code gate hint

badb85d

style on long if condition

967f959

MichaelChirico reviewed Sep 23, 2024

View reviewed changes

R/groupingsets.R Outdated Show resolved Hide resolved

MichaelChirico added 2 commits September 22, 2024 17:37

Use .shallow() over a full copy

0f676e9

save names(label) for reuse; more .shallow() usage

3ddc20b

MichaelChirico reviewed Sep 23, 2024

View reviewed changes

R/groupingsets.R Outdated Show resolved Hide resolved

MichaelChirico mentioned this pull request Sep 23, 2024

New internal class1 helper #6522

Closed

MichaelChirico added 3 commits September 22, 2024 18:20

simplify with mapply

20a9eb0

Build info with gettextf() for i18n

be03776

short names

32c982b

MichaelChirico added 2 commits September 22, 2024 18:41

More restricted scoping, building message with gettextf

5aa23b9

consistency: name 'info'

4583d0d

MichaelChirico approved these changes Sep 23, 2024

View reviewed changes

Merge branch 'master' into master

e582b5a

MichaelChirico merged commit 1494900 into Rdatatable:master Sep 23, 2024
7 of 8 checks passed

MichaelChirico mentioned this pull request Sep 23, 2024

Add Mark to DESCRIPTION #6523

Merged

Add total.label argument for groupingsets, cube, rollup #5973

Add total.label argument for groupingsets, cube, rollup #5973

Conversation

markseeto commented Mar 3, 2024

codecov bot commented Mar 3, 2024 • edited Loading

Codecov Report

jangorecki commented Mar 4, 2024

jangorecki Mar 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jangorecki Mar 4, 2024 • edited Loading

Choose a reason for hiding this comment

markseeto Mar 4, 2024 • edited Loading

Choose a reason for hiding this comment

jangorecki Mar 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jangorecki Mar 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

markseeto commented Mar 4, 2024

jangorecki commented Mar 4, 2024

jangorecki commented Mar 4, 2024

markseeto commented Mar 4, 2024

markseeto commented Mar 4, 2024

jangorecki commented Mar 4, 2024 • edited Loading

markseeto commented Mar 4, 2024

markseeto commented Apr 4, 2024 • edited Loading

This comment was marked as resolved.

This comment was marked as resolved.

MichaelChirico commented Sep 6, 2024

jangorecki left a comment

Choose a reason for hiding this comment

markseeto commented Sep 21, 2024

jangorecki commented Sep 22, 2024

markseeto commented Sep 22, 2024

MichaelChirico commented Sep 23, 2024

MichaelChirico commented Sep 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

markseeto Sep 23, 2024 • edited Loading

Choose a reason for hiding this comment

MichaelChirico commented Sep 23, 2024

MichaelChirico left a comment

Choose a reason for hiding this comment

markseeto commented Sep 23, 2024 • edited Loading

codecov bot commented Mar 3, 2024 •

edited

Loading

jangorecki Mar 4, 2024 •

edited

Loading

jangorecki Mar 4, 2024 •

edited

Loading

markseeto Mar 4, 2024 •

edited

Loading

jangorecki Mar 4, 2024 •

edited

Loading

jangorecki Mar 4, 2024 •

edited

Loading

jangorecki commented Mar 4, 2024 •

edited

Loading

markseeto commented Apr 4, 2024 •

edited

Loading

MichaelChirico commented Sep 23, 2024 •

edited

Loading

markseeto Sep 23, 2024 •

edited

Loading

markseeto commented Sep 23, 2024 •

edited

Loading