I am trying to clear some text strings so that I can cleanly analyze the script information. For these tables, the information in parentheses represents the location or blocking notes for the script.
I would like to take all the information in square brackets and remove the brackets and all their contained characters. The wrench in the works is that since all the data was originally recorded manually, not all the data enclosed in brackets or notes actually have a closing bracket. So - I would like to define:
- [
- any character, except for the closing parenthesis, 0 or more times
- OR closing bracket or newline indicator \ n
Sample data, one very long line (my abbreviation). Usually each line will be an entire script episode:
"[Bridge]\r\r\n\r\r\n SPOCK: Check the circuit. \r\r\n [Pike Quarters \r\r\n BOYCE: Boyce here.\r\r\n"
I tried several gsub permutations, primarily on these lines:
df$script <- gsub("\\[[^\\]]*[\\]|\\n]", " ", testdf$script)
Which, I believe, should capture:
\\[ an open bracket
[^\\]]* any character except for a closed bracket, 0 or more times
[\\]|\\n] either a closed bracket, or a new line metachar
but I get empty every time. I tried other variations of this gsub line, as my regex-fu is what holds me back. All of them were taken without changes in my line:
df$script <- gsub("\\[[^\\]]*[\\]\\n]", " ", testdf$script)
df$script <- gsub("\\[[^\\]]*[\\]|\\n]", " ", testdf$script)
df$script <- gsub("\\[[^\\]]*[\\](\\n)]", " ", testdf$script)
df$script <- gsub("\\[[^\\]]*[\\]|(\\n)]", " ", testdf$script)
I know that regex'ing scraped HTML is likely to give me a smelly face; Unfortunately, this is the only tool I have to deal with this line. I have had varying degrees of success with some other regex language simulations, but there is something about R gsub that is not on board, how I try to process metacharacters. Any advice would be highly appreciated.