Thursday, March 13, 2008

 

python and unicode

Python is pretty good with unicode. But you need to clearly understand how it works fisrt !

[Two worlds:]
The world is divided in two: 'unicode' and other encodings ('ascii', 'iso-8859-1', 'utf-8', 'utf-16'...)
Unicode and these encodings are completely different.

[Your default encoding:]
- To see your current default encoding, "import sys" then "sys.getdefaultencoding()" from python shell
- you cannot change it at run time, you need to change it with a configuration script.
If you want to set 'utf-8' as default encoding, create a new file called 'sitecustomize.py' and save it under '/usr/lib/python2.5/site-packages/ '. The file name is pretty important so that it is automatically launched at runtime.
In the file, write: "import sys" and "sys.setdefaultencoding('utf-8')"
Restart Python
now "sys.getdefaultencoding" should return "utf-8"

[Encoding of script files:]
- Each script file can have its own encoding.
To do so, make appear "coding: XXX" in either the first or second line of your script file, in a single line comment
ex: # this file encoding: utf-8
By specifying this, Python will automatically treats characters from the file with a utf-8 encoding

more explanation:
- let say your default encoding is 'ascii'
- your script file is in UTF-8 but you do not specify the encoding at the top of the file
- in your script file, you have the following string sequence: "my name is rémy"
If you import this script and run the function with the previous string sequence, you will end up with a UnicodeDecodeError exception

- now if you specify the encoding in the first two lines (ex: # encoding: utf8), then the exception disapeared and you have no more error.

[Create unicode strings:]
You have three ways to create unicode strings:
- prepending "u" to your string
ex: u"hello", u"my name is rémy"
- using unicode() function
This function takes an optional argument which is the encoding of the input text, if different from current default encoding
ex: unicode("hello"), unicode("my name is rémy"), unicode('\xff\xfe\xe9\x00', 'utf-16') # prints "é"
- using the decode() method of any string
ex: "hello".decode(), "my name is rémy".decode(), ''\xff\xfe\xe9\x00".decode('utf-16')

[Unicode usage:]
Once you have a unicode string, you can do everything you do with other strings, including string comparaison with non-unicode string ! everything is transparent, python handles unicode very well.
ex (my default encoding is utf-8 so "à" is an allowed char):
>> a = "à" # utf-8 string
>> b = a.decode() # unicode string
>> a == b
True
>> a
'\xc3\xa0' # correspond to the utf-8 code for "à"
>> b
u'\xe0' # correspond to the unicode code for "à"

Python compared the two strings, independently from their respective encoding, which is the behavior we expected

[From unicode string to encoded string:]
At some point, you may need to convert back from unicode to a specific encoding (to dump content into a file, a stream...).
Simply use the .encode() method available on all string, and specify the encoding you want, otherwise default encoding is used.
>> a = "à" # utf8 encoding
>> b = a.decode() #unicode
>> c = b.encode() # back to utf8
>> d = b.encode('utf-16')
>> c
'\xc3\xa0' # correspond to the utf-8 code for "à", similar output for ">> a"
>> d
'\xff\xfe\xe0\x00" # utf-16 code for "à"

equality check:
>> a == b
True # OK, as said before, python transparently convert to unicode so equality is correct
>> a == c
True # OK, same character in same encoding
>> c == d
False #Interesting !!, the encoding is different, c is in utf-8, d is in utf-16, equality fails ! None of them is unicode, so python does not make any transformation to check for equality

[To remember:]
str.decode([encoding]) = goes from specified/default encoding to unicode
str.encode([encoding]) = goes from unicode to specified/default encoding

[sources:]
src: http://www.pyzine.com/Issue008/Section_Articles/article_Encodings.html
book diveintopython, section 9.4 (freely available at www.diveintopython.org)
http://www.reportlab.com/i18n/python_unicode_tutorial.html
http://www.python.org/dev/peps/pep-0263/

Comments: Post a Comment

<< Home

This page is powered by Blogger. Isn't yours?