Parse ½ as 0.5 in Python 2.7

I scraped this link using BeautifulSoup4

I am analyzing an HTML page like this

 page = BeautifulSoup(page.replace('ISO-8859-1', 'utf-8'),"html5lib") 

You can see values ​​like these -4 -115 (separated by - )

I want both values ​​in a list, so I use this regex.

 value = re.findall(r'[+-]?\d+', value) 

It works fine, but not for these values +2½ -102 , I only get [-102]

To handle this, I tried it too

 value = value.replace("½","0.5") value = re.findall(r'[+-]?\d+', value) 

but this gives me an encoding error saying that I need to set the encoding of my file.

I also tried setting encoding=utf-8 at the top of the file, but still giving the same error.

I need to ask how do I convert ½ to 0.5

+5
source share
3 answers

To embed Unicode literals such as ½ in your Python 2 script, you need to use a special comment at the top of your script, which lets the interpreter know how Unicode is encoded. If you want to use UTF-8, you will also need to tell the editor to save the file as UTF-8. And if you want to print Unicode text, make sure your terminal is also configured to use UTF-8.

Here is a short example tested on Python 2.6.6

 # -*- coding: utf-8 -*- value = "a string with fractions like 2½ in it" value = value.replace("½",".5") print(value) 

Output

 a string with fractions like 2.5 in it 

Note that I use ".5" as a replacement string; using "0.5" converts "2½" to "20.5" , which is not true.


Actually, these lines should be marked as Unicode lines, for example:

 # -*- coding: utf-8 -*- value = u"a string with fractions like 2½ in it" value = value.replace(u"½", u".5") print(value) 

For more information about using Unicode in Python, see Pragmatic Unicode , which was written by veteran SO Ned Batchelder.


I should also mention that you will need to change your regular expression pattern so that it has a decimal point in numbers. For instance:

 # -*- coding: utf-8 -*- from __future__ import print_function import re pat = re.compile(r'[-+]?(?:\d*?[.])?\d+', re.U) data = u"+2½ -105 -2½ -115 +2½ -105 -2½ -115 +2½ -102 -2½ -114" print(data) print(pat.findall(data.replace(u"½", u".5"))) 

Output

 +2½ -105 -2½ -115 +2½ -105 -2½ -115 +2½ -102 -2½ -114 [u'+2.5', u'-105', u'-2.5', u'-115', u'+2.5', u'-105', u'-2.5', u'-115', u'+2.5', u'-102', u'-2.5', u'-114'] 
+7
source

There are more vulgar fractions in Unicode than just ½, here is some kind of code to parse everything:

 # coding=utf8 # curl -s "http://www.unicode.org/Public/UNIDATA/extracted/DerivedNumericValues.txt" | grep "VULGAR FRACTION" fractions = { 0x2189: 0.0, # ; ; 0 # No VULGAR FRACTION ZERO THIRDS 0x2152: 0.1, # ; ; 1/10 # No VULGAR FRACTION ONE TENTH 0x2151: 0.11111111, # ; ; 1/9 # No VULGAR FRACTION ONE NINTH 0x215B: 0.125, # ; ; 1/8 # No VULGAR FRACTION ONE EIGHTH 0x2150: 0.14285714, # ; ; 1/7 # No VULGAR FRACTION ONE SEVENTH 0x2159: 0.16666667, # ; ; 1/6 # No VULGAR FRACTION ONE SIXTH 0x2155: 0.2, # ; ; 1/5 # No VULGAR FRACTION ONE FIFTH 0x00BC: 0.25, # ; ; 1/4 # No VULGAR FRACTION ONE QUARTER 0x2153: 0.33333333, # ; ; 1/3 # No VULGAR FRACTION ONE THIRD 0x215C: 0.375, # ; ; 3/8 # No VULGAR FRACTION THREE EIGHTHS 0x2156: 0.4, # ; ; 2/5 # No VULGAR FRACTION TWO FIFTHS 0x00BD: 0.5, # ; ; 1/2 # No VULGAR FRACTION ONE HALF 0x2157: 0.6, # ; ; 3/5 # No VULGAR FRACTION THREE FIFTHS 0x215D: 0.625, # ; ; 5/8 # No VULGAR FRACTION FIVE EIGHTHS 0x2154: 0.66666667, # ; ; 2/3 # No VULGAR FRACTION TWO THIRDS 0x00BE: 0.75, # ; ; 3/4 # No VULGAR FRACTION THREE QUARTERS 0x2158: 0.8, # ; ; 4/5 # No VULGAR FRACTION FOUR FIFTHS 0x215A: 0.83333333, # ; ; 5/6 # No VULGAR FRACTION FIVE SIXTHS 0x215E: 0.875, # ; ; 7/8 # No VULGAR FRACTION SEVEN EIGHTHS } rx = r'(?u)([+-])?(\d*)(%s)' % '|'.join(map(unichr, fractions)) test = u'15⅑ and ¼ and +212½ and -⅜' import re for sign, d, f in re.findall(rx, test): sign = -1 if sign == '-' else 1 d = int(d) if d else 0 number = sign * (d + fractions[ord(f)]) print 'found', number 
+4
source

If you need regex madly, you can use a unicode char, as shown below. The unicode name of this Unicode character is "VULGAR FRACTION ONE HALF" (U + 00BD) for more details see here .

 #!/usr/bin/env python # -*- coding: utf-8 -*- import re txt = u'-½ -103+½ -113-½ -105+½ -115-½ -105+½ -115 My test for Fraction -1½ -115' print ''.join(re.findall(u'[+-]?[\d+]?\u00BD?',txt)) #for replacing print re.sub(ur'\u00BD',ur'.5',txt) 

Output -

 -½-103-113-½-105-115-½-105-115-1½-115 -.5 -103+.5 -113-.5 -105+.5 -115-.5 -105+.5 -115 My test for Fraction -1.5 -115 

NB You can change the script as you want, but you may need to change the VULGAR FRACTION - you will get this encoding in the domain indicated above.

+1
source

Source: https://habr.com/ru/post/1241481/


All Articles