Convert encoding via iconv linux

I used to convert the encoding through iconv, but today I settled on something new for me.
I did a test to make my question clear:

target is converted الحلقة الثالثةto its utf8 version: الحلقة الثالثة

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
<title> this text is from arabic language   </title>
</head>
<body>
<p><span> &#1575;&#1604;&#1581;&#1604;&#1602;&#1577; &#1575;&#1604;&#1579;&#1575;&#1604;&#1579;&#1577;</span></p>
</body>
</html>

tried to use an encoding type ASCII , LATIN1 , windows-1252, but no luck how can I say that this is an encoding type to convert it? were both google translate + stackoverflow editors able to detect it and hide it?

another example: this site http://kanjidict.stc.cx/recode.php was able to correctly convert the encoding if I checkedAssume HTML (default: handle as plain text)

that I went missing and these 3 websites did it to convert it correctly ????

+3
4

,

, , , , ascii2uni

: sudo apt-get install ascii2uni

unicode

ascii2uni -a D source.html > target.html

+5

. Python3.

:

>>> import re
>>> s = r'&#65;&#223;&#254;'
>>> r = re.compile(r'&#(\d+);')
>>> r.sub(lambda m:chr(int(m.group(1))), s)
'Aßþ'

:

>>> import re
>>> s = r'&#x41;&#223;&#xFE;'
>>> r = re.compile(r'&#(x?)(\w+);')
>>> r.sub(lambda m:chr(int(m.group(2), 10 if not m.group(1) else 16)), s)
'Aßþ'
+2
recode html..utf8

, PLS , , .

+1

Source: https://habr.com/ru/post/1784519/


All Articles