String Translation Using Regular Expression in Javascript

Suppose I have a long line containing newline and tab characters:

var x = "This is a long string.\n\t This is another one on next line."; 

So, how can we split this string into tokens using a regular expression?

I don't want to use .split(' ') because I want to learn Javascript Regex.

A more complex line might be:

 var y = "This @is a #long $string. Alright, lets split this."; 

Now I want to extract from this line only valid words without special characters and punctuation, that is, I want:

 var xwords = ["This", "is", "a", "long", "string", "This", "is", "another", "one", "on", "next", "line"]; var ywords = ["This", "is", "a", "long", "string", "Alright", "lets", "split", "this"]; 
+4
source share
6 answers

Here is a jsfiddle example of what you requested: http://jsfiddle.net/ayezutov/BjXw5/1/

Basically, the code is very simple:

 var y = "This @is a #long $string. Alright, lets split this."; var regex = /[^\s]+/g; // This is "multiple not space characters, which should be searched not once in string" var match = y.match(regex); for (var i = 0; i<match.length; i++) { document.write(match[i]); document.write('<br>'); } 

UPDATE : Basically you can expand the list of separator characters: http://jsfiddle.net/ayezutov/BjXw5/2/

 var regex = /[^\s\.,!?]+/g; 

UPDATE 2: Only letters all the time: http://jsfiddle.net/ayezutov/BjXw5/3/

 var regex = /\w+/g; 
+7
source

Use \s+ to tokenize the string.

+2
source

exec can iterate over matches to remove characters other than words (\ W).

 var A= [], str= "This @is a #long $string. Alright, let split this.", rx=/\W*([a-zA-Z][a-zA-Z']*)(\W+|$)/g, words; while((words= rx.exec(str))!= null){ A.push(words[1]); } A.join(', ') /* returned value: (String) This, is, a, long, string, Alright, let's, split, this */ 
+2
source
 var words = y.split(/[^A-Za-z0-9]+/); 
+1
source

To extract only text characters, we use the \w character. Whether this matches Unicode characters or not depends on the implementation, and you can use this link to find out what language / library it is.

See Alexander Yezutov’s answer (update 2) for how to apply this in an expression.

0
source

Here is a solution using regular expression groups to tokenize text using various types of tokens.

Here you can test the code https://jsfiddle.net/u3mvca6q/5/

 /* Basic Regex explanation: / Regex start (\w+) First group, words \w means ASCII letter with \w + means 1 or more letters | or (,|!) Second group, punctuation | or (\s) Third group, white spaces / Regex end g "global", enables looping over the string to capture one element at a time Regex result: result[0] : default group : any match result[1] : group1 : words result[2] : group2 : punctuation , ! result[3] : group3 : whitespace */ var basicRegex = /(\w+)|(,|!)|(\s)/g; /* Advanced Regex explanation: [a-zA-Z\u0080-\u00FF] instead of \w Supports some Unicode letters instead of ASCII letters only. Find Unicode ranges here https://apps.timwhitlock.info/js/regex (\.\.\.|\.|,|!|\?) Identify ellipsis (...) and points as separate entities You can improve it by adding ranges for special punctuation and so on */ var advancedRegex = /([a-zA-Z\u0080-\u00FF]+)|(\.\.\.|\.|,|!|\?)|(\s)/g; var basicString = "Hello, this is a random message!"; var advancedString = "Et en français ? Avec des caractères spéciaux ... With one point at the end."; console.log("------------------"); var result = null; do { result = basicRegex.exec(basicString) console.log(result); } while(result != null) console.log("------------------"); var result = null; do { result = advancedRegex.exec(advancedString) console.log(result); } while(result != null) /* Output: Array [ "Hello", "Hello", undefined, undefined ] Array [ ",", undefined, ",", undefined ] Array [ " ", undefined, undefined, " " ] Array [ "this", "this", undefined, undefined ] Array [ " ", undefined, undefined, " " ] Array [ "is", "is", undefined, undefined ] Array [ " ", undefined, undefined, " " ] Array [ "a", "a", undefined, undefined ] Array [ " ", undefined, undefined, " " ] Array [ "random", "random", undefined, undefined ] Array [ " ", undefined, undefined, " " ] Array [ "message", "message", undefined, undefined ] Array [ "!", undefined, "!", undefined ] null */ 
0
source

Source: https://habr.com/ru/post/1385478/


All Articles