How can I make a regex that takes into account accented characters?

Question

How can I make a regex that takes into account accented characters?

I have a JavaScript regular expression that basically finds two-letter words. The problem is that it interprets accented characters as word boundaries. Indeed, it seems that

The word boundary ("\ b") is the spot between two characters having "\ w" on one side of it and "\ W" on the other side (in any order), counting imaginary characters as the beginning and end of the line as matching "\ W ". AS3 RegExp for matching words with border type characters in them

And since

\ w matches any alphanumeric character (s), including underscore (short for [a-zA-Z0-9_]). \ W matches any characters other than words (short for [^ a-zA-Z0-9_]) http://www.javascriptkit.com/javatutors/redev2.shtml

clearly accented characters are not taken into account. This becomes a problem with words like Montréal . If é is considered the boundary of a word, then al is a two-letter word. I tried to make my own definition of the word boundary, which would allow the use of accented characters, but seeing that the word boundary is not even a symbol, I do not know exactly how to find it.

Any help?

Here is the relevant JavaScript code that searches for userInput and finds two-letter words using the re_state regular expression:

 var re_state = new RegExp("\\b([az]{2})[,]?\\b", "mi"); var match_state = re_state.exec(userInput); document.getElementById("state").value = (match_state)?match_state[1]:"";

+5

javascript regex diacritics word-boundary

Shawn 12 sept '10 at 4:28

source share

2 answers

Alan moore · Answer 1 · 2010-09-12T07:27:22+0000

Although JavaScript regexes in some cases recognize non-ASCII characters (such as \s ), it is hopelessly inadequate when it comes to \w and \b . If you want them to work with anything other than the ASCII character, you have to either use a different language or install Steve Levithan XRegExp with the Unicode plugin .

By the way, there is an error in your regular expression. You have \b after the extra comma, but it should be in front:

 "\\b([az]{2})\\b,?"

I also removed the square brackets; you only need those if the comma had special meaning in regular expressions, which is not. But I suspect that you don't need to match a comma at all; \b should be enough to make sure you are at the end of a word. And if you don't need a comma, you also don't need a capture group:

 "\\b[az]{2}\\b"

Beel · Answer 2 · 2010-09-12T05:10:14+0000

Have you installed JavaScript to use non-ASCII? Here is a page that suggests installing JavaScript to use UTF-8: http://blogs.oracle.com/shankar/entry/how_to_handle_utf_8

It says:

add the charset attribute (charset = "utf-8") to the script tags on the parent page:
 script type="text/javascript" src="[path]/myscript.js" charset="utf-8" 

How can I make a regex that takes into account accented characters?

More articles: