How do dashes work in regular expression?

I'm curious about the algorithm for determining which characters to include in a regular expression when using - ...

 Example: [a-zA-Z0-9] 

This matches any character in any case, from a to z and a number from 0 to 9.

I initially thought that they were used as macros, for example, az translates to a,b,c,d,e , etc., but after I saw the following in open source ,

 text.tr('A-Za-z1-90', 'β’Ά-Ⓩⓐ-β“©β‘ -⑨β“ͺ') 

my regular expression paradigm has completely changed, because these are characters that are not your typical characters, as the devil did it right, I thought.

My theory is that - literally means

Any ASCII value between the left character and the right character. (e.g. az [97-122])

Can anyone confirm the correctness of my theory? Is the regex pattern actually computed using character codes between any character?

Also, if this is correct, you could execute a regex, like

 Az 

because A 65 , and z is 122 , so theoretically it should also match all the characters between these values.

+4
source share
2 answers

From MSDN, regex character classes (bold mine):

The syntax for specifying a range of characters is as follows:

 [firstCharacter-lastCharacter] 

where firstCharacter is a character starting with a range, and lastCharacter is a character that ends with a range. A character range is a continuous series of characters , defined by setting the first character in the series, a hyphen ( - ), and then the last character in the series. Two characters are adjacent if they have adjacent Unicode code points.

So, your assumption is true, but the effect is essentially wider: Unicode character codes, not just ASCII.

+4
source

Both of your assumptions are correct. (so technically you could do [#-~] , and that would still be correct, capturing uppercase letters, lowercase letters, numbers, and certain characters.)

ASCII table

You can also do this with Unicode, for example [\u0000-\u1000] .

However, you should not do [Az] , because there are letters between uppercase and lowercase letters (in particular [, \, ], ^, _, ` ).

+4
source

Source: https://habr.com/ru/post/1487555/


All Articles