Here's the approach, relying heavily on dplyr
.
First, I added a sentence to your sample text to illustrate why we cannot just use a colon to identify speaker names.
sampleText <- "President Dr. Norbert Lammert: I declare the session open. I will now give the floor to Bundesminister Alexander Dobrindt. (Applause of CDU/CSU and delegates of the SPD) Alexander Dobrindt, Minister for Transport and Digital Infrastructure: Ladies and Gentleman. We will today start the biggest investment in infrastructure that ever existed, with over 270 billion Euro, over 1 000 projects and a clear financing perspective. (Volker Kauder [CDU/CSU]: Genau!) (Applause of the CDU/CSU and the SPD) This sentence right here: it is an example of a problem"
Then I broke the text to mimic the format you seem to be reading (which also puts each speech in a part of the list).
splitText <- strsplit(sampleText, "\n")
Then I pull out all the potential speakers (everything that precedes the colon) so that
allSpeakers <- lapply(splitText, function(thisText){ grep(":", thisText, value = TRUE) %>% gsub(":.*", "", .) %>% gsub("\\(", "", .) }) %>% unlist() %>% unique()
What gives us:
[1] "President Dr. Norbert Lammert" [2] "Alexander Dobrindt, Minister for Transport and Digital Infrastructure" [3] "Volker Kauder [CDU/CSU]" [4] "This sentence right here"
Obviously, the latter is not a legitimate name, so it should be excluded from our list of speakers:
legitSpeakers <- allSpeakers[-4]
Now we are ready to work through speech. I have included step-by-step comments below, instead of the description in the text here
speechText <- lapply(splitText, function(thisText){
This gives
[[1]] speakingGroup speaker fullText 1 President Dr. Norbert Lammert I declare the session open.\nI will now give the floor to Bundesminister Alexander Dobrindt. 2 Alexander Dobrindt, Minister for Transport and Digital Infrastructure Ladies and Gentleman. We will today start the biggest investment in infrastructure that ever existed, with over 270 billion Euro, over 1 000 projects and a clear financing perspective.\nThis sentence right here: it is an example of a problem
If you prefer to have each line of text separately, replace the summarise
at the end with mutate(speaker = speaker[1])
and you will get one line for each line of speech, for example:
speaker text speakingGroup President Dr. Norbert Lammert I declare the session open. 1 President Dr. Norbert Lammert I will now give the floor to Bundesminister Alexander Dobrindt. 1 Alexander Dobrindt, Minister for Transport and Digital Infrastructure 2 Alexander Dobrindt, Minister for Transport and Digital Infrastructure Ladies and Gentleman. We will today start the biggest investment in infrastructure that ever existed, with over 270 billion Euro, over 1 000 projects and a clear financing perspective. 2 Alexander Dobrindt, Minister for Transport and Digital Infrastructure This sentence right here: it is an example of a problem 2