TFile.ReadAllText with TEncoding.UTF8 skips the first 3 characters

I have a UTF-8 text file that starts with this line:

<HEAD><META name=GENERATOR content="MSHTML 10.00.9200.16521"><body> 

When I read this file using TFile.ReadAllText using TEncoding.UTF8:

 MyStr := TFile.ReadAllText(ThisFileNamePath, TEncoding.UTF8); 

then the first 3 characters of the text file are omitted, so MyStr causes:

 'AD><META name=GENERATOR content="MSHTML 10.00.9200.16521"><body>...' 

However, when I read this file with TFile.ReadAllText without TEncoding.UTF8:

 MyStr := TFile.ReadAllText(ThisFileNamePath); 

then the file is read fully and correctly:

 <HEAD><META name=GENERATOR content="MSHTML 10.00.9200.16521"><body>... 

Is there a TFile.ReadAllText error?

+6
source share
1 answer

The first three bytes are skipped because the RTL code assumes the file contains the UTF-8 specification. Obviously your file does not.

The TUTF8Encoding class implements the GetPreamble method, which sets the UTF-8 specification. And ReadAllBytes skips the preamble indicated by the encoding you are passing.

One simple solution would be to read the file into a byte array, and then use TEncoding.UTF8.GetString to decode it to a string.

 var Bytes: TBytes; Str: string; .... Bytes := TFile.ReadAllBytes(FileName); Str := TEncoding.UTF8.GetString(Bytes); 

A more comprehensive alternative would be to create an instance of TEncoding that ignores the UTF-8 specification.

 type TUTF8EncodingWithoutBOM = class(TUTF8Encoding) public function Clone: TEncoding; override; function GetPreamble: TBytes; override; end; function TUTF8EncodingWithoutBOM.Clone: TEncoding; begin Result := TUTF8EncodingWithoutBOM.Create; end; function TUTF8EncodingWithoutBOM.GetPreamble: TBytes; begin Result := nil; end; 

Create one of them (you only need one instance for each process) and pass it to TFile.ReadAllText .

The advantage of using a single instance of TUTF8EncodingWithoutBOM is that you can use it anywhere TEncoding is expected.

+9
source

Source: https://habr.com/ru/post/947401/


All Articles