How to break a phrase into words using regex in c #

I am trying to split a sentence / phrase into words using Regex.

var phrase = "This isn't a test."; var words = Regex.Split(phrase, @"\W+").ToList(); 

contains "This", "isn", "t", "a", "test"

Obviously, he raises an apostrophe and splits into it. Can I change this behavior? It should also be multilingual, supporting various languages ​​(Spanish, French, Russian, Korean, etc.).

I need to pass words to spellcheck. In particular, Nhunspell.

 return (from word in words let correct = _engine[langId].Spell(word) where !correct select word).ToList(); 
+6
source share
8 answers

If you want to spell check your spelling, this is a good solution:

 new Regex(@"[^\p{L}]*\p{Z}[^\p{L}]*") 

Basically you can use Regex.Split using the previous regex. It uses unicode syntax, so it will work in several languages ​​(not for most Asian). And he will not break words with apostrophes from a hyphen.

+7
source

Use Split() .

 words = phrase.Split(' '); 

No punctuation.

 words = phrase.Split(new Char [] {' ', ',', '.', ':', , ';', '!', '?', '\t'}); 
+4
source

Due to the fact that a number of languages ​​use very complex rules for combining words into phrases and sentences, you cannot rely on a simple regular expression to get all the words from a piece of text. Even in order to make the language β€œsimple”, like English, you will work in several cases, such as:

  • How to handle words like you, not where two words are combined, but the number of characters replaced by.
  • How to handle abbreviations such as Mrs. ie
  • combined words using '-'
  • transferable words at the end of a sentence.

Chinese and Japanese (among others), as you know, are difficult to parse in this way, since these languages ​​do not use spaces between words, only between sentences.

You might want to read Text Segmentation , and if segmentation is important to you, invest in spellcheckers that can analyze all text or the Text Segmentation engine that can divide your sentences into words according to the rules of the language.

I could not find a .NET-based multilingual segmenting engine with a quick Google search. Unfortunately.

+3
source

It doesn't really seem like you need a regular expression. You could just do:

 phrase.Split(" "); 
+1
source

What do you want to break? Spaces? Punctuation? You must decide what these stop symbols are. A simple regular expression that uses space and multiple punctuation will be "[^.?!\s]+" . This would divide by period, question mark, exclamation mark, and any space characters.

+1
source

You can try if you are trying to break based on spaces only.

 var words = Regex.Split(phrase, @"[^ ]+").ToList(); 

Another approach is to add an apostrophe by adding this to your character class.

 var words = Regex.Split(phrase, @"(\W|')+").ToList(); 

Otherwise, there is a specific reason why you cannot use string.Split ()? It would seem much simpler. In addition, you can also convey other punctuation marks (i.e., divide by, as well as spaces).

 var words = phrase.Split(' '); var words = phrase.Split(new char[] {' ', '.'}); 
+1
source

I am not a java person, but you can try to exclude punctuation when split by
space at the same time. Something like this maybe.

These are raw and extended regular expressions, words are in capture group 1.
Do a global search.

Unicode (does not account for graphemes)

 [\s\pP]* ([\pL\pN_-] (?: [\pL\pN_-] | \pP(?=[\pL\pN\pP_-]) )* ) 

Ascii

 [\s[:punct:]]* (\w (?: \w | [[:punct:]](?=[\w[:punct:]]) )* ) 
0
source

This worked for me: [^(\d|\s|\W)]*

0
source

Source: https://habr.com/ru/post/913681/


All Articles