How to convert a complex binary Perl regular expression to C # or PowerShell?

This Perl binary regular expression, found at http://www.w3.org/International/questions/qa-forms-utf-8.en.php , matches UTF-8 documents without the UTF-8 specification header:

$field =~
m/\A(
 [\x09\x0A\x0D\x20-\x7E]            # ASCII
 | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
 |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
 | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
 |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
 |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
 | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
 |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
)*\z/x;

I need this because I'm working on the PowerShell equivalent for 'grep -I' , and part of this includes text encoding detection.

But how do I rewrite this in C # or PowerShell? Or, in other words, in the ".Net Regex" syntax?

: http://social.msdn.microsoft.com/Forums/en-US/regexp/thread/6a81be63-e6da-4156-a5bf-8b9782a1ac40 Regex . , .Net, .NET .

+3
4

: ( , LINQPad).

new Regex(@"
    ^(
    [\x09\x0A\x0D\x20-\x7E]            # ASCII
    | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
    |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
    | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
    |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
    |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
    | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
    |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
    )*$", RegexOptions.IgnorePatternWhitespace)

ASCII StreamReader; , . ( , )

+1

, , UTF-8, UTF-8. RegExps .Net, -, , . (, mycommand -autodetect) , .

       string result=String.Empty;
        Encoding ae = Encoding.GetEncoding(
              Encoding.UTF8.EncodingName,
              new EncoderExceptionFallback(), 
              new DecoderExceptionFallback());
        try {
            result=ae.GetString(mybytes);
        }
        catch (DecoderFallbackException e)
        {
            //revert to some sensible default. Maybe the Ansi Code page for this environment?
            // This will use the substitution fallback mechanism, which usually replaces unknown characters with question marks.
            result=Encoding.Default.GetString(mybytes);
        }

If you can interact with unmanaged code, explore the MLANG dll that ships with IE. It has alternative encoding auto-detection methods that may be more useful.

+1
source

What exactly are you trying to do?

You must use a class System.Text.Encoding.

0
source

Source: https://habr.com/ru/post/1712144/


All Articles