Cixar

XML/RSS

Categories:

/ (104)
  art/ (4)
  bookmark/ (2)
  langlubber/ (4)
  movies/ (2)
  music/ (1)
    garageband/ (2)
  photo/ (1)
  politics/ (1)
  program/ (27)
    cli/ (1)
    javascript/ (12)
      chiron/ (5)
    swil/ (2)
    tale/ (22)
  reading/ (4)
  tale/ (23)
  writing/ (2)

Archives:

2008-May
2008-Apr
2008-Mar
2008-Feb
2008-Jan
2007-Jun
2007-May
2007-Apr
2007-Mar
2007-Feb
2007-Jan
2006-Oct
2006-Sep
2006-Aug
2006-Jun
2006-May
2006-Apr
2006-Mar
2006-Feb
2006-Jan
2005-Dec
2005-Nov
2005-Oct
2005-Sep
2005-Aug
2005-Jul
2005-Jun
2005-May
2005-Apr
2005-Mar


The Sourcerer

by Kris Kowal.

Thu, 21 Feb 2008

Unicode

What I've learned today—

In order for a program or library to operate in a Unicode compatible fashion, all strings must be in Unicode. All input strings must be brought into Unicode, and all output strings must be sent out of Unicode at the very last possible moment. This is because, outside of Unicode strings, encoding is not a function of type and type information does not generally cross API boundaries accurately, plus regular expressions don't play well against mixed-length characters.

Django works entirely in Unicode and drops a string to UTF-8 at the last moment. I had need for Python's Textile library to transform text, but it only dealt with byte strings. Anyhow, It turned out to be quick work to change all of its strings to unicode and not bother with encoding and decoding.

this entry was posted on Thu, 21 Feb 2008 at 20:10 in program