I just found the qdapRegex package for R, part of the larger qdap packages that Jason Gray and Tyler Rinker have put together for supporting text munging/processing for discourse analysis, etc. There’s a lot in there, with four libraries, including the Regex set, some tools, dictionaries and a qdap proper for the qualitative analysis (pre)-processing.

The Regex alone seems worthwhile pulling down and keeping local. There are some nice convenience functions for removing phone numbers, names, tags, zips, etc. from text in R.

The authors have built a dependency on knitr for documentation, probably to promote a good publishing solution and to reduce the maintenance overhead. I just pulled down qdap 2.2.2 from CRAN and compiled from sources happily until the final push:

and this then opened a warning on my deck about how I needed to install java 1.6, which is already EOL. I haven’t found yet what in the installer is calling for this, but will try to get back to it. This needs to be fixed.

Besides that, there is much to praise in these packages and I look forward to exploring it further. For those that haven’t pulled it down, I post this as it is just another nice little convenience:

> library(qdapRegex)
> cheat()
1 Lookahead (?=foo) What follows is `foo`
2 Lookbehind (?<=foo) What precedes is `foo` 3 Negative Lookahead (?!foo) What follows is not `foo` 4 Negative Lookbehind (?= 0 (Greedy) x* Match 0 or more times greedy
20 >= 0 (Lazy) x*? Match 0 or more times lazy
21 >= 1 (Greedy) x+ Match 1 or more times greedy
22 >= 1 (Lazy) x+? Match 1 or more times lazy
23 Exactly N x{4} Match N times
24 Min-Max x{4,8} Match min-max times
25 > N x{9,} Match N or more times


