Count the number of all words in a string

Question

Count the number of all words in a string

Is there a function to count the number of words per line? For example:

str1 <- "How many words are in this sentence"

to return the result 7.

+63

string r word-count

John Jan 19 2018-12-01T00:

source share

17 answers

Use the gregexpr character \\W to match gregexpr characters, use + to specify one or more per line, along with gregexpr to find all matches in a string. Words is the number of word separators plus 1.

 lengths(gregexpr("\\W+", str1)) + 1

This will not work with empty lines at the beginning or end of a character vector when the “word” does not satisfy the concept of \\W non-words (you can work with other regular expressions, \\S+ , [[:alpha:]] , etc. , but there will always be extreme cases with the regex approach), etc. This is probably more efficient than strsplit solutions that will allocate memory for each word. Regular expressions are described in ?regex .

Update As noted in the comments and in another @Andri answer, the approach fails with (zero) and single-word strings, and with final punctuation

 str1 = c("", "x", "xy", "xy!" , "xy! z") lengths(gregexpr("[Az]\\W+", str1)) + 1L # [1] 2 2 2 3 3

Many of the other answers also fail in these or similar (e.g., in several spaces) cases. I think my answer to the “single word concept” warning in the original answer covers punctuation problems (solution: select another regular expression, for example, [[:space:]]+ ), but cases with zero and one word are the problem ; @Andri's solution cannot distinguish between zero and one word. Therefore, using a "positive" approach to finding words, you can

 sapply(gregexpr("[[:alpha:]]+", str1), function(x) sum(x > 0))

Leading to

 sapply(gregexpr("[[:alpha:]]+", str1), function(x) sum(x > 0)) # [1] 0 1 2 2 3

Again, the regular expression can be refined for different concepts of the word.

I like using gregexpr() because it uses memory efficiently. An alternative to using strsplit() (e.g. @ user813966, but with a regex for word delimitation) and using the original word delimiter is

 lengths(strsplit(str1, "\\W+")) # [1] 0 1 2 2 3

For this, it is necessary to allocate a new memory for each word created and for an intermediate list of words. It can be relatively expensive when the data is "big", but it is likely to be effective and understandable for most purposes.

+65

Martin Morgan Jan 19 '12 at 2:15

source share

The easiest way :

 require(stringr) str_count("one, two three 4,,,, 5 6", "\\S+")

... counting all sequences on non-spatial characters ( \\S+ ).

But what about a small function that also allows us to decide which words we would like to count and which ones work on whole vectors ?

 require(stringr) nwords <- function(string, pseudo=F){ ifelse( pseudo, pattern <- "\\S+", pattern <- "[[:alpha:]]+" ) str_count(string, pattern) } nwords("one, two three 4,,,, 5 6") # 3 nwords("one, two three 4,,,, 5 6", pseudo=T) # 6

+35

petermeissner Oct. 16 '14 at 14:22

source share

I am using the str_count function from the stringr library with the escape sequence \w which represents:

any character of the word (letter, number or underscore in the current region: in UTF-8 mode only letters and numbers ASCII are taken into account)

Example:

 > str_count("How many words are in this sentence", '\\w+') [1] 7

Of the remaining 9 answers that I was able to check, only two (according to Vincent Zoonekind and Petermensner) worked for all the materials presented here, but they also require stringr .

But only this solution works with all the inputs presented so far, as well as inputs such as "foo+bar+baz~spam+eggs" or "Combien de mots sont dans cette phrase?" ,

Reference point:

 library(stringr) questions <- c( "", "x", "xy", "xy!", "xy! z", "foo+bar+baz~spam+eggs", "one, two three 4,,,, 5 6", "How many words are in this sentence", "How many words are in this sentence", "Combien de mots sont dans cette phrase ?", " Day after day, day after day, We stuck, nor breath nor motion; " ) answers <- c(0, 1, 2, 2, 3, 5, 6, 7, 7, 7, 12) score <- function(f) sum(unlist(lapply(questions, f)) == answers) funs <- c( function(s) sapply(gregexpr("\\W+", s), length) + 1, function(s) sapply(gregexpr("[[:alpha:]]+", s), function(x) sum(x > 0)), function(s) vapply(strsplit(s, "\\W+"), length, integer(1)), function(s) length(strsplit(gsub(' {2,}', ' ', s), ' ')[[1]]), function(s) length(str_match_all(s, "\\S+")[[1]]), function(s) str_count(s, "\\S+"), function(s) sapply(gregexpr("\\W+", s), function(x) sum(x > 0)) + 1, function(s) length(unlist(strsplit(s," "))), function(s) sapply(strsplit(s, " "), length), function(s) str_count(s, '\\w+') ) unlist(lapply(funs, score))

Exit:

 6 10 10 8 9 9 7 6 6 11

+22

arekolek Jun 27 '16 at 15:37

source share

 str2 <- gsub(' {2,}',' ',str1) length(strsplit(str2,' ')[[1]])

gsub(' {2,}',' ',str1) ensures that all words are separated by only one space, replacing all occurrences of two or more spaces with one space.

strsplit(str,' ') splits the sentence in each space and returns the result in the list. [[1]] captures a vector of words from this list. length counts how many words.

 > str1 <- "How many words are in this sentence" > str2 <- gsub(' {2,}',' ',str1) > str2 [1] "How many words are in this sentence" > strsplit(str2,' ') [[1]] [1] "How" "many" "words" "are" "in" "this" "sentence" > strsplit(str2,' ')[[1]] [1] "How" "many" "words" "are" "in" "this" "sentence" > length(strsplit(str2,' ')[[1]]) [1] 7

+15

mathematical.coffee Jan 19 '12 at 2:01

source share

You can use str_match_all , with a regular expression that identifies your words. The following works with initial, final and duplicate space.

 library(stringr) s <- " Day after day, day after day, We stuck, nor breath nor motion; " m <- str_match_all( s, "\\S+" ) # Sequences of non-spaces length(m[[1]])

+13

Vincent Zoonekynd Jan 19 2018-12-12T00:

source share

Try this function from stringi package

  require(stringi) > s <- c("Lorem ipsum dolor sit amet, consectetur adipisicing elit.", + "nibh augue, suscipit a, scelerisque sed, lacinia in, mi.", + "Cras vel lorem. Etiam pellentesque aliquet tellus.", + "") > stri_stats_latex(s) CharsWord CharsCmdEnvir CharsWhite Words Cmds Envirs 133 0 30 24 0 0

+11

bartektartanus Mar 14 '14 at 9:54

source share

You can use the wc function in the qdap library:

 > str1 <- "How many words are in this sentence" > wc(str1) [1] 7

+7

yuqian Mar 20 '16 at 15:51

source share

You can remove double spaces and count the number of " " in the string to get the number of words. Use stringr and rm_white { qdapRegex }

 str_count(rm_white(s), " ") +1

+6

Murali Menon Mar 03 '16 at 9:16

source share

try it

 length(unlist(strsplit(str1," ")))

+5

Sangram Jul 04 '14 at 6:38

source share

Decision 7 does not give the correct result when there is only one word. You should not just count the elements as a result of the gregexpr result (which is -1 if they don't match there), but count the elements> 0.

Ergo:

 sapply(gregexpr("\\W+", str1), function(x) sum(x>0) ) + 1

+4

Andri Dec 19 '12 at 9:14

source share

Also, from the stringi package, the straight forward function stri_count_words

 stringi::stri_count_words(str1) #[1] 7

+1

Sotos Jun 29 '18 at 10:05

source share

require (stringr)

Define a very simple function

 str_words <- function(sentence) { str_count(sentence, " ") + 1 }

Check out

 str_words(This is a sentence with six words)

+1

JDie Nov 30 '18 at 18:37

source share

Use nchar

if the row vector is called x

 (nchar(x) - nchar(gsub(' ','',x))) + 1

Find the number of spaces, then add one

0

Jonny Jan 06 '15 at 16:50

source share

 require(stringr) str_count(x,"\\w+")

will be good with double / triple spacing between words

All other answers have problems with more than one space between words.

0

CJunk Aug 25 '17 at 22:45

source share

Using the stringr package, you can also write a simple script that can pass a vector of strings, for example, through a for loop.

Let them say

Df $ text

contains a vector of lines that we are interested in analyzing. First, we add additional columns to the existing df data frame, as shown below:

 df$strings = as.integer(NA) df$characters = as.integer(NA)

Then we run the for loop for the row vector, as shown below:

 for (i in 1:nrow(df)) { df$strings[i] = str_count(df$text[i], '\\S+') # counts the strings df$characters[i] = str_count(df$text[i]) # counts the characters & spaces }

Resulting columns: rows and character will contain the number of words and characters, and this will be achieved at a time for the row vector.

0

Sadiaz Mar 12 '19 at 0:32

source share

I found the following function and regular expression useful for word counting, especially when working with single and double hyphens, where the former should not usually be considered a word break, for example, the well-known hi-fi; while a double hyphen is a punctuation delimiter that is not limited to spaces - for example, for notes in brackets.

 txt <- "Don't you think e-mail is one word--and not two!" #10 words words <- function(txt) { length(attributes(gregexpr("(\\w|\\w\\-\\w|\\w\\'\\w)+",txt)[[1]])$match.length) } words(txt) #10 words

Stringi is a useful package. But in this example, he overestimates the words due to a hyphen.

 stringi::stri_count_words(txt) #11 words

0

Soren Mar 28 '19 at 11:15

source share

AVSuresh · Accepted Answer · 2012-07-17 04:46

You can use strsplit and sapply

 sapply(strsplit(str1, " "), length)

Count the number of all words in a string

More articles: