Issue
I am scraping this link with BeautifulSoup4
I am parsing page HTML like this
page = BeautifulSoup(page.replace('ISO-8859-1', 'utf-8'),"html5lib")
You can see the values like these -4 -115
(separated by -
)
I want both values in a list so I am using this regex.
value = re.findall(r'[+-]?\d+', value)
It works perfectly but not for these values +2½ -102
, I only get [-102]
To tackle this, I tried this too
value = value.replace("½","0.5")
value = re.findall(r'[+-]?\d+', value)
but this gives me error about encoding saying I have to set encoding of my file.
I also tried setting encoding=utf-8
at top of file but still gives same error.
I need to ask how do I convert ½
to 0.5
Solution
To embed Unicode literals like ½ in your Python 2 script you need to use a special comment at the top of your script that lets the interpreter know how the Unicode has been encoded. If you want to use UTF-8 you will also need to tell your editor to save the file as UTF-8. And if you want to print Unicode text make sure your terminal is set to use UTF-8, too.
Here's a short example, tested on Python 2.6.6
# -*- coding: utf-8 -*-
value = "a string with fractions like 2½ in it"
value = value.replace("½",".5")
print(value)
output
a string with fractions like 2.5 in it
Note that I'm using ".5"
as the replacement string; using "0.5"
would convert "2½"
to "20.5"
, which would not be correct.
Actually, those strings should be marked as Unicode strings, like this:
# -*- coding: utf-8 -*-
value = u"a string with fractions like 2½ in it"
value = value.replace(u"½", u".5")
print(value)
For further information on using Unicode in Python, please see Pragmatic Unicode, which was written by SO veteran Ned Batchelder.
I should also mention that you will need to change your regex pattern so that it allows a decimal point in numbers. Eg:
# -*- coding: utf-8 -*-
from __future__ import print_function
import re
pat = re.compile(r'[-+]?(?:\d*?[.])?\d+', re.U)
data = u"+2½ -105 -2½ -115 +2½ -105 -2½ -115 +2½ -102 -2½ -114"
print(data)
print(pat.findall(data.replace(u"½", u".5")))
output
+2½ -105 -2½ -115 +2½ -105 -2½ -115 +2½ -102 -2½ -114
[u'+2.5', u'-105', u'-2.5', u'-115', u'+2.5', u'-105', u'-2.5', u'-115', u'+2.5', u'-102', u'-2.5', u'-114']
Answered By - PM 2Ring
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.