What strategies exist for escaping characters?

We perform natural language processing on a number of documents in English (mostly scientific) and encounter problems when transferring non-ANSI characters through various components. Documents can be "ASCII", UNICODE, PDF or HTML. At this stage, we cannot predict what tools will be in our chain or whether they will allow us to encode characters other than ANSI. Even ISO Latin characters expressed in UNICODE will cause problems (for example, displaying incorrectly in browsers). We are likely to come across a number of characters, including mathematical and Greek. We would like to โ€œsmoothโ€ them into a text string that can withstand multi-step processing (including XML and regular expression tools), and then, perhaps, recreates it in the last step (although this is semantics, not typography,with which we are connected, so this is a minor issue).

I appreciate that there is no absolute answer - in some cases, any escaping can conflict, but I'm looking for something in the XML strings <![CDATA[ ...]]>that will withstand most non-recursive XML operations. Characters such as [are bad because they are common in regular expressions. Therefore, I wonder if there is a generally accepted approach, and not come up with our own.

A typical example is the degree symbol:

HTML Entity (decimal)   &#176;
HTML Entity (hex)   &#xb0;
HTML Entity (named)     &deg;
How to type in Microsoft Windows    Alt +00B0
Alt 0176
Alt 248
UTF-8 (hex)     0xC2 0xB0 (c2b0)
UTF-8 (binary)  11000010:10110000
UTF-16 (hex)    0x00B0 (00b0)
UTF-16 (decimal)    176
UTF-32 (hex)    0x000000B0 (00b0)
UTF-32 (decimal)    176
C/C++/Java source code  "\u00B0"
Python source code  u"\u00B0"

We are also likely to encounter TeX

$10\,^{\circ}{\rm C}$

or

\degree

so backslashes, pegs and dollars are a bad idea.

We could, for example, use markup like:

__deg__
__#176__

and it probably will work, but I will be grateful for the consultation of those who have similar problems.

update @MichaelB, UTF-8 . , , , . , - .

+3
2
  • - , . , , . .
  • brew - , . UTF-8 ( , ) . , UTF-7, .
  • . , - , , , .
+4

, , escape, , , , base32.

, .

, -

the value of the temperature was 18 cd48d8c50d7f40aeb6a164181b17feee EZSGKZY= cd48d8c50d7f40aeb6a164181b17feee

- uuid, base32. cd48d8c50d7f40aeb6a164181b17feee. ( , , , , , , , - ), , .

, uuids , , , ( , inbetween base32).

, uuid, . :

>>> re.search("(\w{8}-\w{4}-\w{4}-\w{4}-\w{12})(.*?)(\\1)", s)
<_sre.SRE_Match object at 0x1003d31f8>
>>> _.groups()
('6d378205-1265-44e4-80b8-a47d1ceaad51', ' EZSGKZY= ', '6d378205-1265-44e4-80b8-a47d1ceaad51')
>>> 

"" , uuid1 node:

>>> uuid.uuid1(node=0x1234567890)  
UUID('bdcce554-e95d-11de-bd0f-001234567890')
>>> uuid.uuid1(node=0x1234567890)  
UUID('c4c57a91-e95d-11de-90ca-001234567890')
>>> 

, , node, uuid , ( ).

+1

Source: https://habr.com/ru/post/1725696/


All Articles