TL;DR: grepl expects its first argument to be a string (length 1), not a vector. You can solve this with combinations of sapply and lapply (see below), but you are better served using a single regular expression that captures what you want to match in df1.MATCH and not use df2.PATTERN at all. This second option is much faster (if less intelligle) for a large data set. For this type of work, it is worth learning how to use regular expressions to their full potential.
df1 %>% filter(grepl(pattern = "^((ABC)( )*)+$", x = df1.MATCH, ignore.case = TRUE))
Explanation
The documentation for grepl shows the following usage:
grepl(pattern, x, ignore.case = FALSE, perl = FALSE,
fixed = FALSE, useBytes = FALSE)
The pattern argument is first, and this argument should be a string (one element). You are providing df1.MATCH to this argument, which is a vector.
We could use sapply to apply grepl to each element of df1.MATCH.
sapply(df1.MATCH, grepl, x = df2.PATTERN)
ABC abc BCD
[1,] TRUE FALSE FALSE
[2,] FALSE TRUE FALSE
[3,] TRUE TRUE FALSE
However, look at the output! You probably did not want a matrix. What happens when we run your grepl one just the first element of df1.MATCH?
grepl("ABC",df2.PATTERN)
[1] TRUE FALSE TRUE
We get a vector because grepl is checking ABC against each element of df2.PATTERN. To get a useful logical vector for filtering, you need to return a logical vector of the same length as df1.MATCH. I see two ways to do it.
Method 1: Use any
Since you want to know which elements in df1.MATCH match any elements in df2.PATTERN, you can use any, which returns TRUE if any element in its arguments is TRUE. We need a little bit different syntax to make this work. We need to wrap grepl in lapply to make a list of three vectors (one for each element in df1.MATCH1) that feeds into sapply wrapped any. If we just use sapply, any will only return one value since we have a matrix input.
any(grepl("ABC", df2.PATTERN))
[1] TRUE
sapply(
lapply(df1.MATCH, grepl, x = df2.MATCH),
any)
[1] TRUE TRUE FALSE
Method 2: Write a better regular expression.
You want to match the contents of df1.MATCH against possible values that look like abc, ABC, ABC ABC, or ABC abc, etc. You can encompass all of this in a single regex string. The string you want is
"^((ABC)( )*)+$"
^ # Nothing else before this
(ABC) # Must contain ABC together as a group
( )* # followed by any number of spaces (including 0)
((ABC)( )*)+ # Look for the ABC (space) pattern repeated one or more times
$ # And nothing else after it
Then use grepl with ignore.case = TRUE:
grepl("^((ABC)( )*)+$", df1.MATCH, ignore.case = TRUE)
[1] TRUE TRUE FALSE
Benchmarking
In a large dataset, one of these will perform faster. Let's find out. Your benchmark results will vary by your machine's resources.
df1.MATCH <- sample(c("ABC", "abc" ,"BCD"), size = 100000, replace = TRUE)
df1 <- data.frame(df1.MATCH)
df2.PATTERN <- c("ABC", "abc", "ABC abc")
library(rbenchmark)
benchmark("any lapply" = {
df1 %>%
filter(sapply(lapply(df1.MATCH, grepl, x=df2.PATTERN), any) )
},
"better regex" = {
df1 %>%
filter(grepl("^((ABC)( )*)+$", df1.MATCH, ignore.case = TRUE))
}
)
test replications elapsed relative user.self sys.self user.child sys.child
1 any lapply 100 149.13 70.678 147.67 0.39 NA NA
2 better regex 100 2.11 1.000 2.10 0.02 NA NA
It looks like the improved regex method is significantly faster. That's because it is performing only one operation per row (grepl) before filtering. The other method is performing four operations per row: lapply is performing grepl three times (one for each element of df2.PATTERN, and sapply then performs any for each list element (each row).