python - converting a string which contains both utf-8 encoded bytestrings and codepoints to utf-8 encoded string -


i' m getting json response api looks this:

{"excerpt":"...where we\u00e2\u0080\u0099ll have wait , see, i\u00e2\u0080\u0099m sure official announcements start flowing in coming months \u2013special\u2013..."} 

this raw json response returning api call. now, see there codepoints in json document should when transferring unicode data. api response returning wrong codepoints because 'excerpt' starting "... we’ ll ..." @ original source excerpt belongs to. see \u00e2\u0080\u0099 sequence used representing ’ -right single quotation mark- character character 's codepoint \u2019 , equivalent bytestring encoded utf-8 \xe2\x80\x99. returning respective bytestring instead of codepoint. other problem response contains right codepoints \u2013 (dash character) in previous response , makes code unable handle both situations.

i have fetch fields response(probably using json.loads , converts \u00e2\u0080\u0099 \xe2\x80\x99 nothing \u2013), concatenate fields , send result library uses urllib.urlencode encode result valid utf-8 url parameter sending api.

so here question: there way encode string contains both utf-8 bytestrings , unicode codepoints(this result of doing json.loads) string contains codepoints or utf-8 bytestrings can use in urllib.urlencode or may there solution before doing json.loads ? note: i' m using python 2.6.1

i have contacted api owners , informed them should use valid codepoints instead of bytestrings i' m not sure when contact me i' m trying come solution current situation.

any appreciated.

you use regular expression identify "utf-8-like" unicode sequences , process them correct unicode character:

import re d = {"excerpt":"...where we\u00e2\u0080\u0099ll have wait , see, i\u00e2\u0080\u0099m sure official announcements start flowing in coming months \u2013special\u2013..."} s = d['excerpt'] print s s = s.decode('unicode-escape') print s print re.sub(ur'[\xc2-\xf4][\x80-\xbf]+',lambda m: m.group(0).encode('latin1').decode('utf8'),s) 

output:

...where we\u00e2\u0080\u0099ll have wait , see, i\u00e2\u0080\u0099m sure official announcements start flowing in coming months \u2013special\u2013... ...where weâll have wait , see, iâm sure official announcements start flowing in coming months –special–... ...where we’ll have wait , see, i’m sure official announcements start flowing in coming months –special–... 

update...

from comment, dictionary unicode string, \u2013 characters print correctly (see first print output below) decode('unicode-escape') can skipped. re.sub statement still works:

import re d = {u'excerpt':u'...where we\xe2\x80\x99ll have wait , see, i\xe2\x80\x99m sure official announcements start flowing in coming months \u2013special\u2013...'} s = d[u'excerpt'] print s print re.sub(ur'[\xc2-\xf4][\x80-\xbf]+',lambda m: m.group(0).encode('latin1').decode('utf8'),s) 

output:

...where weâll have wait , see, iâm sure official announcements start flowing in coming months –special–... ...where we’ll have wait , see, i’m sure official announcements start flowing in coming months –special–... 

Comments