Encoding data when submitting a PDF using AcroForm technology

When I create a PDF form (for example, using Acrobat) that contains text fields in AcroForm format (PDF dictionaries, XFA) and I send data to the server, how can I specify / get the encoding to be used?

For instance. When I send Chinese glyphs ζ΅‹θ―• '(test), I get the following headers and server-side content:

accept: application/x-ms-application, image/jpeg, application/xaml+xml, image/gif, image/pjpeg, application/x-ms-xbap, application/vnd.ms-excel, application/vnd.ms-powerpoint, application/msword, */* content-type: application/x-www-form-urlencoded content-length: 23 acrobat-version: 10.1.4 user-agent: Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; MDDC; .NET4.0C; AskTbCLA/5.15.1.22229) accept-encoding: gzip, deflate connection: Keep-Alive Song=%b2%e2%ca%d4&Test= 

There is no encoding reference except x-www-form-urlencoded. Two glyphs are represented as four bytes: B2 E2 CA D4. After some research, I know that B2E2 is the GBK value for the first glyph, and CAD4 is the GBK value for the second glyph, but I cannot extract it from the request header.

Is it always GBK? I want to change the data encoding by setting a specific key in the dictionary in PDF, but it seems not. For example: I would like to make sure that PDF always sends Unicode characters instead of GBK.

Note that I already experimented by changing the default font (and encoding) of the text field. I also searched for ISO-32000-1 for field encodings, but all I found was a way to define non-Latin characters for checkboxes and some information about the encoding of the FDF file. None of them answered my questions.

+4
source share
1 answer

I just found the answer to my main question. I did not find anything in ISO-32000-1 or the ISO-32000-2 project, but while studying the Acrobat JavaScript link, I found the cCharset parameter available for the submitForm() method. This parameter defines:

The encoding for the presented values. String values: utf-8, utf-16, Shift-JIS, BigFive, GBK, and UHC. If it fails, Acrobat behavior is in effect. For XML-based formats, utf-8 is used. For other formats, Acrobat is trying to find the best host encoding for the presented values. The XFDF view ignores this value and always uses utf-8.

In other words: in my case, GBK was used because it is best suited for serving Chinese characters. However, you can force UTF-8 to use the submitForm() JavaScript method using the appropriate value.

Based on this question, I asked the ISO committee to solve this problem in ISO-32000-2. As a result, an additional possible entry was added to the table, entitled "Additional entries specific to the action of the submit form" in section 12.7.6.2:

CharSet : string

(Optional; inherited) Possible values: utf-8, utf-16, Shift-JIS, BigFive, GBK or UHC.

Starting with PDF 2.0, this problem will no longer exist.

+8
source

Source: https://habr.com/ru/post/1436395/


All Articles