Skip to content

Commit

Permalink
names(.SD) should work (#4163)
Browse files Browse the repository at this point in the history
* Update data.table.R

* Update tests.Rraw

* Update data.table.R

* Update tests.Rraw

* Update datatable-reference-semantics.Rmd

* Update assign.Rd

* Update NEWS.md

* Update NEWS.md

* Update data.table.R

* Update tests.Rraw

* Update tests.Rraw

* Update data.table.R

* Update tests.Rraw

* replace iris with raw dataset

* Update tests.Rraw

* update replace_names_sd and made .SD := not work

* change .SD to names(.SD)

* update typo; change .SD to names(.SD)

* update to names(.SD)

* include names(.SD) and fx to .SD usage

I may have went too far. There's no use of ```(cols) := ...``` now but there is at least a reference to the other vignette.

* Updates news to names(.SD)

* Update typo.

* tweak NEWS

* minor grammar

* jans comment

* jan's comment (ii)

* added "footnote"

* Add is.name(e[[2L]])

* Put tests above Add new tests here

* added test to test names(.SD(2))

* include .SDcols in example for assign

* included .SDcols = function example

* test 2138 is greater than 2137

* bad merge

* Make updates per Michael's comments.

* Added test where .SD is used as well as some columns not in .SD.

* Mention count of reactions in issue

* small copy-edit

* more specific

* specify LHS/RHS

* Simplify implementation to probe for names(.SD) and new test

* fine-tune comment

---------

Co-authored-by: Michael Chirico <michaelchirico4@gmail.com>
  • Loading branch information
ColeMiller1 and MichaelChirico authored Mar 20, 2024
1 parent 958e3dd commit 3eefbca
Show file tree
Hide file tree
Showing 6 changed files with 78 additions and 30 deletions.
2 changes: 2 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,8 @@

5. `transpose` gains `list.cols=` argument, [#5639](https://github.com/Rdatatable/data.table/issues/5639). Use this to return output with list columns and avoids type promotion (an exception is `factor` columns which are promoted to `character` for consistency between `list.cols=TRUE` and `list.cols=FALSE`). This is convenient for creating a row-major representation of a table. Thanks to @MLopez-Ibanez for the request, and Benjamin Schwendinger for the PR.

4. Using `dt[, names(.SD) := lapply(.SD, fx)]` now works, [#795](https://github.com/Rdatatable/data.table/issues/795) -- one of our [most-requested issues (see #3189)](https://github.com/Rdatatable/data.table/issues/3189). Thanks to @brodieG for the report, 20 or so others for chiming in, and @ColeMiller1 for PR.

## BUG FIXES

1. `unique()` returns a copy the case when `nrows(x) <= 1` instead of a mutable alias, [#5932](https://github.com/Rdatatable/data.table/pull/5932). This is consistent with existing `unique()` behavior when the input has no duplicates but more than one row. Thanks to @brookslogan for the report and @dshemetov for the fix.
Expand Down
4 changes: 2 additions & 2 deletions R/data.table.R
Original file line number Diff line number Diff line change
Expand Up @@ -1122,8 +1122,8 @@ replace_dot_alias = function(e) {
if (is.name(lhs)) {
lhs = as.character(lhs)
} else {
# e.g. (MyVar):= or get("MyVar"):=
lhs = eval(lhs, parent.frame(), parent.frame())
# lhs is e.g. (MyVar) or get("MyVar") or names(.SD) || setdiff(names(.SD), cols)
lhs = eval(lhs, list(.SD = setNames(logical(length(sdvars)), sdvars)), parent.frame())
}
} else {
# `:=`(c2=1L,c3=2L,...)
Expand Down
30 changes: 30 additions & 0 deletions inst/tests/tests.Rraw
Original file line number Diff line number Diff line change
Expand Up @@ -18359,3 +18359,33 @@ test(2249.2, indices(DT[, .SD]), 'x')
setindex(DT, y)
test(2249.3, indices(DT), c('x', 'y'))
test(2249.4, indices(DT[, .SD]), c('x', 'y'))

# make names(.SD) work - issue #795
dt = data.table(a = 1:4, b = 5:8)
test(2250.01, dt[, names(.SD) := lapply(.SD, '*', 2), .SDcols = 1L], data.table(a = 1:4 * 2, b = 5:8))
test(2250.02, dt[, names(.SD) := lapply(.SD, '*', 2), .SDcols = 2L], data.table(a = 1:4 * 2, b = 5:8 * 2))
test(2250.03, dt[, names(.SD) := lapply(.SD, as.integer)], data.table(a = as.integer(1:4 * 2), b = as.integer(5:8 * 2)))
test(2250.04, dt[1L, names(.SD) := lapply(.SD, '+', 2L)], data.table(a = as.integer(c(4, 2:4 * 2)), b = as.integer(c(12, 6:8 * 2))))
test(2250.05, dt[, setdiff(names(.SD), 'a') := NULL], data.table(a = as.integer(c(4, 2:4 * 2))))
test(2250.06, dt[, c(names(.SD)) := NULL], null.data.table())

dt = data.table(a = 1:4, b = 5:8, grp = c('a', 'a', 'b', 'c'))
test(2250.07, dt[, names(.SD) := lapply(.SD, max), by = grp], data.table(a = c(2L, 2L, 3L, 4L), b = c(6L, 6L, 7L, 8L), grp = c('a', 'a', 'b', 'c')))

dt = data.table(a = 1:4, b = 5:8, grp = c('a', 'a', 'b', 'c'))
keep = c('a', 'b')
test(2250.08, dt[, names(.SD) := NULL, .SDcols = !keep], data.table(a = 1:4, b = 5:8))

dt = data.table(a = 1:4, b = 5:8, grp = c('a', 'a', 'b', 'c'))
test(2250.09, dt[, paste(names(.SD), 'max', sep = '_') := lapply(.SD, max), by = grp] , data.table(a = 1:4, b = 5:8, grp = c('a', 'a', 'b', 'c'), a_max = c(2L, 2L, 3L, 4L), b_max = c(6L, 6L, 7L, 8L)))

dt = data.table(a = 1:3, b = 5:7, grp = c('a', 'a', 'b'))
test(2250.10, dt[1:2, paste(names(.SD), 'max', sep = '_') := lapply(.SD, max), by = grp], data.table(a = 1:3, b = 5:7, grp = c('a', 'a', 'b'), a_max = c(2L, 2L, NA_integer_), b_max = c(6L, 6L, NA_integer_)))
test(2250.11, dt[, names(.SD(2)) := lapply(.SD, .I)], error = 'could not find function ".SD"')

dt = data.table(a = 1:3, b = 5:7, grp = c('a', 'a', 'b'))
test(2250.12, dt[, names(.SD) := lapply(.SD, \(x) x + b), .SDcols = "a"], data.table(a = 1:3 + 5:7, b = 5:7, grp = c('a', 'a', 'b')))


dt = data.table(a = 1L, b = 2L, c = 3L, d = 4L, e = 5L, f = 6L)
test(2250.13, dt[, names(.SD)[1:5] := sum(.SD)], data.table(a = 21L, b = 21L, c = 21L, d = 21L, e = 21L, f = 6L))
3 changes: 3 additions & 0 deletions man/assign.Rd
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,9 @@
# LHS2 = RHS2,
# ...), by = ...]

# 3. Multiple columns in place
# DT[i, names(.SD) := lapply(.SD, fx), by = ..., .SDcols = ...]

set(x, i = NULL, j, value)
}
\arguments{
Expand Down
17 changes: 17 additions & 0 deletions vignettes/datatable-reference-semantics.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -258,6 +258,23 @@ flights[, c("speed", "max_speed", "max_dep_delay", "max_arr_delay") := NULL]
head(flights)
```

#### -- How can we update multiple existing columns in place using `.SD`?

```{r}
flights[, names(.SD) := lapply(.SD, as.factor), .SDcols = is.character]
```
Let's clean up again and convert our newly-made factor columns back into character columns. This time we will make use of `.SDcols` accepting a function to decide which columns to include. In this case, `is.factor()` will return the columns which are factors. For more on the **S**ubset of the **D**ata, there is also an [SD Usage vignette](https://cran.r-project.org/web/packages/data.table/vignettes/datatable-sd-usage.html).

Sometimes, it is also nice to keep track of columns that we transform. That way, even after we convert our columns we would be able to call the specific columns we were updating.
```{r}
factor_cols <- sapply(flights, is.factor)
flights[, names(.SD) := lapply(.SD, as.character), .SDcols = factor_cols]
str(flights[, ..factor_cols])
```
#### {.bs-callout .bs-callout-info}

* We also could have used `(factor_cols)` on the `LHS` instead of `names(.SD)`.

## 3. `:=` and `copy()`

`:=` modifies the input object by reference. Apart from the features we have discussed already, sometimes we might want to use the update by reference feature for its side effect. And at other times it may not be desirable to modify the original object, in which case we can use `copy()` function, as we will see in a moment.
Expand Down
52 changes: 24 additions & 28 deletions vignettes/datatable-sd-usage.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,15 @@ The first way to impact what `.SD` is is to limit the _columns_ contained in `.S
Pitching[ , .SD, .SDcols = c('W', 'L', 'G')]
```

This is just for illustration and was pretty boring. But even this simply usage lends itself to a wide variety of highly beneficial / ubiquitous data manipulation operations:
This is just for illustration and was pretty boring. In addition to accepting a character vector, `.SDcols` also accepts:

1. any function such as `is.character` to filter _columns_
2. the function^{*} `patterns()` to filter _column names_ by regular expression
3. integer and logical vectors

*see `?patterns` for more details

This simple usage lends itself to a wide variety of highly beneficial / ubiquitous data manipulation operations:

## Column Type Conversion

Expand All @@ -91,52 +99,40 @@ We notice that the following columns are stored as `character` in the `Teams` da
# teamIDretro: Team ID used by Retrosheet
fkt = c('teamIDBR', 'teamIDlahman45', 'teamIDretro')
# confirm that they're stored as `character`
Teams[ , sapply(.SD, is.character), .SDcols = fkt]
str(Teams[ , ..fkt])
```

If you're confused by the use of `sapply` here, note that it's quite similar for base R `data.frames`:

```{r identify_factors_as_df}
setDF(Teams) # convert to data.frame for illustration
sapply(Teams[ , fkt], is.character)
setDT(Teams) # convert back to data.table
```

The key to understanding this syntax is to recall that a `data.table` (as well as a `data.frame`) can be considered as a `list` where each element is a column -- thus, `sapply`/`lapply` applies the `FUN` argument (in this case, `is.character`) to each _column_ and returns the result as `sapply`/`lapply` usually would.

The syntax to now convert these columns to `factor` is very similar -- simply add the `:=` assignment operator:
The syntax to now convert these columns to `factor` is simple:

```{r assign_factors}
Teams[ , (fkt) := lapply(.SD, factor), .SDcols = fkt]
Teams[ , names(.SD) := lapply(.SD, factor), .SDcols = patterns('teamID')]
# print out the first column to demonstrate success
head(unique(Teams[[fkt[1L]]]))
```

Note that we must wrap `fkt` in parentheses `()` to force `data.table` to interpret this as column names, instead of trying to assign a column named `'fkt'`.
Note:

Actually, the `.SDcols` argument is quite flexible; above, we supplied a `character` vector of column names. In other situations, it is more convenient to supply an `integer` vector of column _positions_ or a `logical` vector dictating include/exclude for each column. `.SDcols` even accepts regular expression-based pattern matching.
1. The `:=` is an assignment operator to update the `data.table` in place without making a copy. See [reference semantics](https://cran.r-project.org/web/packages/data.table/vignettes/datatable-reference-semantics.html) for more.
2. The LHS, `names(.SD)`, indicates which columns we are updating - in this case we update the entire `.SD`.
3. The RHS, `lapply()`, loops through each column of the `.SD` and converts the column to a factor.
4. We use the `.SDcols` to only select columns that have pattern of `teamID`.

Again, the `.SDcols` argument is quite flexible; above, we supplied `patterns` but we could have also supplied `fkt` or any `character` vector of column names. In other situations, it is more convenient to supply an `integer` vector of column _positions_ or a `logical` vector dictating include/exclude for each column. Finally, the use of a function to filter columns is very helpful.

For example, we could do the following to convert all `factor` columns to `character`:

```{r sd_as_logical}
# while .SDcols accepts a logical vector,
# := does not, so we need to convert to column
# positions with which()
fkt_idx = which(sapply(Teams, is.factor))
Teams[ , (fkt_idx) := lapply(.SD, as.character), .SDcols = fkt_idx]
head(unique(Teams[[fkt_idx[1L]]]))
fct_idx = Teams[, which(sapply(.SD, is.factor))] # column numbers to show the class changing
str(Teams[[fct_idx[1L]]])
Teams[ , names(.SD) := lapply(.SD, as.character), .SDcols = is.factor]
str(Teams[[fct_idx[1L]]])
```

Lastly, we can do pattern-based matching of columns in `.SDcols` to select all columns which contain `team` back to `factor`:

```{r sd_patterns}
Teams[ , .SD, .SDcols = patterns('team')]
# now convert these columns to factor;
# value = TRUE in grep() is for the LHS of := to
# get column names instead of positions
team_idx = grep('team', names(Teams), value = TRUE)
Teams[ , (team_idx) := lapply(.SD, factor), .SDcols = team_idx]
Teams[ , names(.SD) := lapply(.SD, factor), .SDcols = patterns('team')]
```

** A proviso to the above: _explicitly_ using column numbers (like `DT[ , (1) := rnorm(.N)]`) is bad practice and can lead to silently corrupted code over time if column positions change. Even implicitly using numbers can be dangerous if we don't keep smart/strict control over the ordering of when we create the numbered index and when we use it.
Expand Down

0 comments on commit 3eefbca

Please sign in to comment.