Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[misc] cleanups and avoid some unneeded Rcpp::String #1224

Merged
merged 10 commits into from
Dec 28, 2024
Merged

Conversation

JanMarvin
Copy link
Owner

TL;DR: This PR lifts the burden of calling Rcpp::String() for the cc columns r, row_r and c_r, which are filled for every cell of data frame (A1, 1 A). This should bring some speedups.

Back in the day we were troubled by non unicode systems which were unable to convert unicode characters into their locale. The solution is to pass strings through Rcpp::String() which makes sure that the correct locale is picked up. Unfortunately this is quite slow in comparison to std::string() (IIRC every string has to be created in R, contrary to std::string() which does not have to be passed to R). Using Rcpp::String() is not only slow, it also worsens with more data. Ultimately, if Rcpp::String() is called a lot, it will slow things down.

Sadly we will have to remain using Rcpp::String(), for the foreseeable future, because non unicode locales wont go away anytime soon. Maybe another solution is to use enc2utf8() and enc2native(), similarly to how IIRC cpp11 does it, but this would require other code changes and I'm not entirely sure that we will see additional gains.

Still, we do not have to pass every cell of every cc column through Rcpp::String(). There are AFAIK only a few that can actually contain unicode strings and only the ones containing unicode characters, actually benefit from Rcpp::String():
is and maybe ph and in some rarer cases v (only in combination with str, otherwise it should always return integers or numerics) and f (formula using UTF8-strings). Hence, we should treat only these.

We have many more Rcpp::String() calls in the code, but (hopefully) these are not called hundreds of times (unless when parsing the shared string table) and would require further testing in non UTF-8 operating system and R. And I don't have any non UTF-8 system and I don't want to further research this :)

@JanMarvin JanMarvin merged commit a6d173c into main Dec 28, 2024
9 checks passed
@JanMarvin JanMarvin deleted the loadvals_impr branch December 28, 2024 13:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant