[misc] cleanups and avoid some unneeded Rcpp::String
#1224
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
TL;DR: This PR lifts the burden of calling
Rcpp::String()
for thecc
columnsr
,row_r
andc_r
, which are filled for every cell of data frame (A1
,1
A
). This should bring some speedups.Back in the day we were troubled by non unicode systems which were unable to convert unicode characters into their locale. The solution is to pass strings through
Rcpp::String()
which makes sure that the correct locale is picked up. Unfortunately this is quite slow in comparison tostd::string()
(IIRC every string has to be created in R, contrary tostd::string()
which does not have to be passed to R). UsingRcpp::String()
is not only slow, it also worsens with more data. Ultimately, ifRcpp::String()
is called a lot, it will slow things down.Sadly we will have to remain using
Rcpp::String()
, for the foreseeable future, because non unicode locales wont go away anytime soon. Maybe another solution is to useenc2utf8()
andenc2native()
, similarly to how IIRCcpp11
does it, but this would require other code changes and I'm not entirely sure that we will see additional gains.Still, we do not have to pass every cell of every
cc
column throughRcpp::String()
. There are AFAIK only a few that can actually contain unicode strings and only the ones containing unicode characters, actually benefit fromRcpp::String()
:is
and maybeph
and in some rarer casesv
(only in combination withstr
, otherwise it should always return integers or numerics) andf
(formula using UTF8-strings). Hence, we should treat only these.We have many more
Rcpp::String()
calls in the code, but (hopefully) these are not called hundreds of times (unless when parsing the shared string table) and would require further testing in non UTF-8 operating system and R. And I don't have any non UTF-8 system and I don't want to further research this :)