Issue
The following unicode and string can exist on their own if defined explicitly:
>>> value_str='Andr\xc3\xa9'
>>> value_uni=u'Andr\xc3\xa9'
If I only have u'Andr\xc3\xa9'
assigned to a variable like above, how do I convert it to 'Andr\xc3\xa9'
in Python 2.5 or 2.6?
EDIT:
I did the following:
>>> value_uni.encode('latin-1')
'Andr\xc3\xa9'
which fixes my issue. Can someone explain to me what exactly is happening?
Solution
You seem to have gotten your encodings muddled up. It seems likely that what you really want is u'Andr\xe9'
which is equivalent to 'André'
.
But what you have seems to be a UTF-8 encoding that has been incorrectly decoded. You can fix it by converting the unicode string to an ordinary string. I'm not sure what the best way is, but this seems to work:
>>> ''.join(chr(ord(c)) for c in u'Andr\xc3\xa9')
'Andr\xc3\xa9'
Then decode it correctly:
>>> ''.join(chr(ord(c)) for c in u'Andr\xc3\xa9').decode('utf8')
u'Andr\xe9'
Now it is in the correct format.
However instead of doing this, if possible you should try to work out why the data has been incorrectly encoded in the first place, and fix that problem there.
Answered By - Mark Byers
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.