1

R newbie here. I'm doing some text analysis using the package quanteda. Basically, what I'm trying to do is put all the words follow the regex pattern child|(care) basically to capture any text which includes any of the words "child" or "care".
To do this, I can create a list and then use the dictionary function:
childcare_list <- c("child","care")
word_dict <- dictionary(list(childcare = childcare_list)).

However, how could I incorporate regex and do this for other patterns which would be tedious to type up manually as in the first line? For example, I may want to capture something like
\bC\w?V\w?D\-19 which captures possible typos of "COVID-19" e.g. "CiVID-19", "CpVID-19".
I could of course do covid_list <- c("CiVID-19", "CpVID-19", ...) but that would be too manual. As well, it doesn't use the \b anchor.

Basically, asking if it's possible to make it so a list contains all possible combinations of a regex.

Shayan Shafiq
  • 1,012
  • 4
  • 11
  • 24
user116883
  • 11
  • 2

1 Answers1

1

This doesn't seem like a great task for regex--even your pattern would miss very close typos like COWID-19 or potential OCR mistakes like C0VID-I9. Instead, I'd suggest using the stringdist package to do fuzzy matching, perhaps stringdist::afind to find approximate matches of "COVID-19". You can read a bit about it here.

This will let you select from a variety of string distance algorithms and set a maximum distance. You could then, e.g., correct matches to "COVID-19" and proceed with your analysis.

  • Thanks! Originally my question was essentially: if I have a list of strings, how can I incorporate regex to extend that list to detect every single possible combination of that regex? (I'm still quite curious if this is possible). But I think your suggestion fits my purpose much better. Just so I'm not mistaken - would I apply this fuzzy matching to the original data set (where I was originally performing regex on), and then just doing covid_list <- c("COVID-19"), after correcting this original data set (so the original data set wouldn't have "COWID-19" anymore because it's been corrected)?. – user116883 Apr 30 '21 at 05:52
  • The issue is that "every single possible combination of that regex" isn't generally possible, and even in your simple example the result would be huge. `\bC` matches any non-word character followed by a C. If we assume ASCII, the smallest possible character set, there are 255 characters, of which 10 numbers + 26 lower case letters + 26 upper case letters = 62 "word characters". So there are 255 - 62 = 193 possibilities for what `\bC` matches. Extend the pattern to `\bC\w` and there are 193 * 62 ~= 12000 strings that would match. The third `\w` brings us up to 77000 possible strings.... – Gregor Thomas Apr 30 '21 at 14:25
  • So even such a simple regex as yours explodes into very large numbers. But many regex patterns are way more complicated - and this is assuming ASCII, if you extend to UTF-8 the possible combinations are exponentially more. So I don't think there's a lot of generalizability or utility in such a tool, and I doubt you'll find a production-grade implementation. – Gregor Thomas Apr 30 '21 at 14:28
  • So yeah, use `stringdist` to find fuzzy matches, then replace the matches you find with "COVID-19". – Gregor Thomas Apr 30 '21 at 14:30