Row splitting and frequency table generation in R

I have a brand name column in an R data frame that looks something like this:

"ABC Industries" "ABC Enterprises" "123 and 456 Corporation" "XYZ Company" 

And so on. I am trying to create frequency tables for each word that appears in this column, for example, for example:

 Industries 10 Corporation 31 Enterprise 40 ABC 30 XYZ 40 

I'm relatively new to R , so I was wondering how to do this. Should I break lines and put every single word in a new column? Is there a way to split a verbose line into several lines in one word?

+4
source share
3 answers

If you want it, you can do it in one layer:

 R> text <- c("ABC Industries", "ABC Enterprises", + "123 and 456 Corporation", "XYZ Company") R> table(do.call(c, lapply(text, function(x) unlist(strsplit(x, " "))))) 123 456 ABC and Company 1 1 2 1 1 Corporation Enterprises Industries XYZ 1 1 1 1 R> 

Here I use strsplit() to break down each login element; this returns a list (in a list). I use do.call() to simply combine all the resulting lists into a single vector, which is summarized by table() .

+9
source

Here is another liner. It uses paste() to combine all the column entries into one long text string, which then splits and tabs:

 text <- c("ABC Industries", "ABC Enterprises", "123 and 456 Corporation", "XYZ Company") table(strsplit(paste(text, collapse=" "), " ")) 
+6
source

You can use the tidytext and dplyr :

 set.seed(42) text <- c("ABC Industries", "ABC Enterprises", "123 and 456 Corporation", "XYZ Company") data <- data.frame(category = sample(text, 100, replace = TRUE), stringsAsFactors = FALSE) library(tidytext) library(dplyr) data %>% unnest_tokens(word, category) %>% group_by(word) %>% count() #> # A tibble: 9 x 2 #> # Groups: word [9] #> word n #> <chr> <int> #> 1 123 29 #> 2 456 29 #> 3 abc 45 #> 4 and 29 #> 5 company 26 #> 6 corporation 29 #> 7 enterprises 21 #> 8 industries 24 #> 9 xyz 26 
0
source

Source: https://habr.com/ru/post/1388549/


All Articles