-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Character sets #9
Comments
Hi @dmi3kno I'm going to try and summarize just to make sure I understand this all. Regarding the challenges:
Regarding the solution:
# old
x <- rx_word_edge() %>%
rx_alpha() %>%
rx_one_or_more() %>%
rx_word_edge()
# new
x <- rx_word_edge() %>%
rx_alpha(rep = "any") %>%
rx_word_edge()
With the latest version of library(RVerbalExpressions)
rx_word_char <- function(.data = NULL, value = NULL) {
if(missing(value))
return(paste0(.data, "\\w"))
paste0(.data, sanitize(value))
}
rx_group <- function(.data = NULL, value) {
paste0(.data, "[", value, "]")
}
rx_any_of <- function(.data = NULL, value, ...) {
if(missing(...))
return(paste0(.data, "[", sanitize(value), "]"))
paste0(.data, "[", value, sanitize(...), "]")
}
rx_literal <- function(.data = NULL, value) {
paste0(.data, value)
}
x <- rx_word_edge() %>%
rx_any_of(rx_word_char(), ".%+-") %>%
rx_one_or_more() %>%
rx_literal("@") %>%
rx_any_of(rx_word_char(), ".-") %>%
rx_one_or_more() %>%
rx_word_char(".") %>%
rx_alpha() %>%
rx_count(n = 2:6) %>%
rx_word_edge()
txt <- "This text contains email first.last@gmail.com and noname@post.io. The latter is no longer valid."
stringr::str_extract_all(txt, x)[[1]]
#> [1] "first.last@gmail.com" "noname@post.io" Looking at that long pipe makes the |
Sorry for messy post. I was writing it and contributing new functions at the same time, so it reflects my own evolution of thinking. I will be more consistent going forward.
# the following is equivalent to `[a-zA-Z]*?`
rx_alpha(rep="any", mode="lazy")
rx_one_of <- function(.data = NULL, ... ) {
args <- sapply(list(...), function(x) if(inherits(x, "rx_string")) x else sanitize(x))
args_str <- Reduce(paste0, args)
paste0(.data, "[", args_str, "]")
} This would require custom class to be output by every of our functions: rx_word_char <- function(.data = NULL) {
res <- paste0(.data, "\\w")
class(res) <- unique(c("rx_string", class(res))) # to avoid accidental double "classing"
res
}
rx_literal <- function(.data=NULL, value) {
res <- paste0(.data, sanitize(value))
class(res) <- unique(c("rx_string", class(res))) # to avoid accidental double "classing"
res
} But then you can do things like: rx() %>%
rx_one_of(rx_word_char(), rx_literal(value="?"), "abc")
#> [1] "[\\w\\?abc]" |
100% agree, I would rather have an intuitive API that does less rather than a somewhat clunky API that can do a whole lot. Given the number of functions that have been added, I wonder if a vignette covering common regex use cases and which functions to use would be helpful?
rx_one_of <- function(.data = NULL, ... ) {
args <- sapply(list(...), function(x) if(inherits(x, "rx_string")) x else sanitize(x))
args_str <- Reduce(paste0, args)
paste0(.data, "[", args_str, "]")
} Looks great, I haven't done much or any programming using ellipses but this looks much more elegant! Very excited about this. |
To do here:
|
Problem
I think the package will be incomplete until we find a way to express groups of characters. Here's a challenge to express email pattern matching in
rx
:Challenges
First of all, I dont know of the way to express single "word" character (
alnum
+_
). We usedrx_word
to denote\\w+
and perhaps it should have beenrx_word_char() %>% rx_one_or_more()
.I also extended
rx_count
to cases of ranges of inputFinally, we dont have a way to express word boundaries (
\\b
) and it might be useful to denote them. We shall call this functionrx_word_edge
Finally, our biggest problem is that there's no way to express groups of characters, other than through
rx_any_of()
, but if we pass otherrx
expressions, values will be sanitized twice, meaning that we will get four backslashes before each symbol instead of two.Solution
Here's what it looks like when we put all pieces together:
The code works but I don't like it.
rx
look redundant (I believe, there's a way to get rid of it entirely using specialized class, see below).rx_one_or_more()
is referring to. I wonder if all functions should haverep
argument with default optionone
and optionssome
/any
in addition to whatrx_count
does today.rx_char()
without arguments be calledrx_wordchar
?rx_char()
with arguments be calledrx_literal()
orrx_plain
?rx_group
is artificial construct, a duplicate ofrx_any_of
, but without sanitization. Here I see couple of solutions.a. Allow "nested pipes" (as I have done above). Create S3 class and this way detect when type of
value
argument is not character, butrx_string
. Input of this class do not need to be sanitized, because it has been sanitized at creation.b. Do not allow "nested pipes". Instead define
rx_any_of()
to have...
and allow multiple arguments mixing functions and characters. Then hypotherical pipe would look like this:It's a lot to digest, but somehow everything related to one particular problem. Happy to split the issue once we identify the issues worth tackling.
The text was updated successfully, but these errors were encountered: