How to enable a subset of UNICODE code pages when checking input?

I am creating a service that can "enter international markets" into markets that do not speak English. I don’t want to limit the username to the ASCII character range, but I would like the user to specify their “natural” username. OK, use UNICODE (and say UTF-8 as text encoding of my name).

But! I do not want users to create “unnamed” user names containing “code” characters. For example, I do not want to allow a username, for example, √√√√√øøøøøøø.

Is there a list of character code points for UNICODE that I can check (possibly with a regex) to accept / reject this username?

Thanks!

+3
source share
2 answers

Unicode has several categories , so you can easily exclude characters. How exactly does this depend on the language you use. Some regex frameworks have a built-in function, and some do not.

+4
source

In Python (per Introductory checking Unicode text in free form in Python ):

def only_letters(s):
    """
    Returns True if the input text consists of letters and ideographs only, False otherwise.
    """
    for c in s:
        cat = unicodedata.category(c)
        # Ll=lowercase, Lu=uppercase, Lo=ideographs
        if cat not in ('Ll','Lu','Lo'):
            return False
    return True

> only_letters('Bzdrężyło')
True
> only_letters('He7lo') # we don't allow digits here
False
0
source

Source: https://habr.com/ru/post/1719408/


All Articles