Separate speaker and dialogue in RStudio

I have documents such as:

President Dr. Norbert Lammer: I declare the session open.

I now give the floor to the Bundesminister Alexander Dobrindt.

(Applause CDU / CSU and SPD Delegates)

Alexander Dobrindt, Minister of Transport and Digital Infrastructure:

Ladies and Gentlemen. Today we will begin the largest investment in infrastructure that has ever existed: more than 270 billion euros, more than 1,000 projects and a clear prospect of financing.

(Volker Cowder [CDU / CSU]: Genu!)

(Applause CDU / CSU and SPD)

And when I read these .txt documents, I would like to create a second column with the name of the speaker.

So, I tried first to create a list of all possible names and replace them.

library(qdap) members <- c("Alexander Dobrindt, Minister for Transport and Digital Infrastructure:","President Dr. Norbert Lammert:") members_r <- c("@Alexander Dobrindt, Minister for Transport and Digital Infrastructure:","@President Dr. Norbert Lammert:") prok <- scan(".txt", what = "character", sep = "\n") prok <- mgsub(members,members_r,prok) prok <- as.data.frame(prok) prok$speaker <- grepl("@[^\\@:]*:",prok$prok, ignore.case = T) 

My plan was then to get the name between @ and: via regex if speaker = true and apply it down until there is another name (and obviously remove all the applause / curves), but this is also where I I don’t know how I could do it.

+6
source share
3 answers

Here's the approach, relying heavily on dplyr .

First, I added a sentence to your sample text to illustrate why we cannot just use a colon to identify speaker names.

 sampleText <- "President Dr. Norbert Lammert: I declare the session open. I will now give the floor to Bundesminister Alexander Dobrindt. (Applause of CDU/CSU and delegates of the SPD) Alexander Dobrindt, Minister for Transport and Digital Infrastructure: Ladies and Gentleman. We will today start the biggest investment in infrastructure that ever existed, with over 270 billion Euro, over 1 000 projects and a clear financing perspective. (Volker Kauder [CDU/CSU]: Genau!) (Applause of the CDU/CSU and the SPD) This sentence right here: it is an example of a problem" 

Then I broke the text to mimic the format you seem to be reading (which also puts each speech in a part of the list).

 splitText <- strsplit(sampleText, "\n") 

Then I pull out all the potential speakers (everything that precedes the colon) so that

 allSpeakers <- lapply(splitText, function(thisText){ grep(":", thisText, value = TRUE) %>% gsub(":.*", "", .) %>% gsub("\\(", "", .) }) %>% unlist() %>% unique() 

What gives us:

 [1] "President Dr. Norbert Lammert" [2] "Alexander Dobrindt, Minister for Transport and Digital Infrastructure" [3] "Volker Kauder [CDU/CSU]" [4] "This sentence right here" 

Obviously, the latter is not a legitimate name, so it should be excluded from our list of speakers:

 legitSpeakers <- allSpeakers[-4] 

Now we are ready to work through speech. I have included step-by-step comments below, instead of the description in the text here

 speechText <- lapply(splitText, function(thisText){ # Remove applause and interjections (things in parentheses) # along with any blank lines; though you could leave blanks if you want cleanText <- grep("(^\\(.*\\)$)|(^$)", thisText , value = TRUE, invert = TRUE) # Split each line by a semicolor strsplit(cleanText, ":") %>% lapply(function(x){ # Check if the first element is a legit speaker if(x[1] %in% legitSpeakers){ # If so, set the speaker, and put the statement in a separate portion # taking care to re-collapse any breaks caused by additional colons out <- data.frame(speaker = x[1] , text = paste(x[-1], collapse = ":")) } else{ # If not a legit speaker, set speaker to NA and reset text as above out <- data.frame(speaker = NA , text = paste(x, collapse = ":")) } # Return whichever version we made above return(out) }) %>% # Bind all of the rows together bind_rows %>% # Identify clusters of speech that go with a single speaker mutate(speakingGroup = cumsum(!is.na(speaker))) %>% # Group by those clusters group_by(speakingGroup) %>% # Collapse that speaking down into a single row summarise(speaker = speaker[1] , fullText = paste(text, collapse = "\n")) }) 

This gives

 [[1]] speakingGroup speaker fullText 1 President Dr. Norbert Lammert I declare the session open.\nI will now give the floor to Bundesminister Alexander Dobrindt. 2 Alexander Dobrindt, Minister for Transport and Digital Infrastructure Ladies and Gentleman. We will today start the biggest investment in infrastructure that ever existed, with over 270 billion Euro, over 1 000 projects and a clear financing perspective.\nThis sentence right here: it is an example of a problem 

If you prefer to have each line of text separately, replace the summarise at the end with mutate(speaker = speaker[1]) and you will get one line for each line of speech, for example:

 speaker text speakingGroup President Dr. Norbert Lammert I declare the session open. 1 President Dr. Norbert Lammert I will now give the floor to Bundesminister Alexander Dobrindt. 1 Alexander Dobrindt, Minister for Transport and Digital Infrastructure 2 Alexander Dobrindt, Minister for Transport and Digital Infrastructure Ladies and Gentleman. We will today start the biggest investment in infrastructure that ever existed, with over 270 billion Euro, over 1 000 projects and a clear financing perspective. 2 Alexander Dobrindt, Minister for Transport and Digital Infrastructure This sentence right here: it is an example of a problem 2 
+1
source

Here is an approach:

  require (qdap) #text is the document text # remove round brackets and text b/w () a <- bracketX(text, "round") names <- c("President Dr. Norbert Lammert","Alexander Dobrindt" ) searchString <- paste(names[1],names[2], sep = ".+") # Get string from names[1] till names[2] with the help of searchString string <- regmatches(a, regexpr(searchString, a)) # remove names[2] from string string <- gsub(names[2],"",string) 

This code can be encoded if there are more than two names.

+2
source

It seems to work

 library(qdap) members <- c("Alexander Dobrindt, Minister for Transport and Digital Infrastructure:","President Dr. Norbert Lammert:") members_r <- c("@Alexander Dobrindt, Minister for Transport and Digital Infrastructure:","@President Dr. Norbert Lammert:") testprok <- read.table("txt",header=FALSE,quote = "\"",comment.char="",sep="\t") testprok$V1 <- mgsub(members,members_r,testprok$V1) testprok$V2 <- ifelse(grepl("@[^\\@:]*:",testprok$V1),testprok$V1,NA) ####function from http://stackoverflow.com/questions/7735647/replacing-nas-with-latest-non-na-value repeat.before = function(x) { # repeats the last non NA value. Keeps leading NA ind = which(!is.na(x)) # get positions of nonmissing values if(is.na(x[1])) # if it begins with a missing, add the ind = c(1,ind) # first position to the indices rep(x[ind], times = diff( # repeat the values at these indices c(ind, length(x) + 1) )) # diffing the indices + length yields how often } # they need to be repeated testprok$V2 = repeat.before(testprok$V2) 
+1
source

Source: https://habr.com/ru/post/1013089/


All Articles