Links
- Unicode in Python
- stackoverflow.com: How to use list of python objects whose representation is unicode
- How to Use UTF-8 with Python
- Unicode for python identifiers
- Supported in python 3.
- See PEP 3131: Supporting Non-ASCII Identifiers
- Allowed characters – not exhaustive
- More About Unicode in Python 2 and 3 | Armin Ronacher
- Bugs
- bug 4947 and bugfix
sys.stdoutfails to use default encoding as advertised.- Fixed in python 2.7 but not backported to python 2.6.
- The bug:
print >>my_file, my_unicode # <- is encoded with my_file.encoding
my_file.write(my_unicode) # <- is encoded with my_file.encoding
print my_unicode -- works # <- is encoded with my_file.encoding
sys.stdout.write(my_unicode) # <- is encoded with sys.getdefaultencoding()
- bug 4947 and bugfix
sys.stdout- Even if your terminal is UTF-8 and things magically appear to work, they may break when you're piping the output.
- Under Python 2, treat stdin and stdout as byte streams.
Notes
Specify the encoding of files
Ref: PEP 263: Defining Python Source Code Encodings
#!/usr/bin/env python # -*- coding: UTF-8 -*-
Specifying unicode strings
>>> u"\N{GREEK CAPITAL LETTER DELTA}" # Using the character name '\u0394' >>> u"\u0394" # Using a 16-bit hex value '\u0394' >>> u"\U00000394" # Using a 32-bit hex value '\u0394'
Convert a bytes string that is somehow typed unicode to str.
# `ISO-8859-1` aka `Latin-1` is the only encoding whose # 256 characters are identical to the 256 first characters of # Unicode. import codecs str_string = codecs.latin_1_encode(unicode_string_which_is_actually_binary)
The various encodings
locale.getpreferredencoding() sys.getfilesystemencoding() sys.stdin.encoding / sys.stdout.encoding / sys.stderr.encoding
Opening files.
import io infile = io.open('UTF-8.txt', 'rt', encoding='UTF-8') import codecs codecs.open('UTF-8.txt', 'rt', encoding='UTF-8') # python 3 open(filename, 'r', encoding='UTF-8')
sys.stdout's encoding
import codecs import sys sys.stdout = codecs.getwriter('UTF-8')(sys.stdout)
sys.stdin's encoding
For Python 3, refer: sys.stdin docs
Refer:
- Python 2: Why do I have to do:
sys.stdin = codecs.getreader(sys.stdin.encoding)(sys.stdin)- Or
sys.stdin = codecs.getreader(locale.getpreferredencoding())(sys.stdin) - This is because reads do not perform conversion – you get bytes (the encoding attribute doesn't do anything – it only affects writes.)
- Or
- Python 3: How to specify stdin encoding
Python 3
Python 3 does not expect ASCII from sys.stdin. It'll open
stdin in text mode and make an educated guess as to what
encoding is used. That guess may come down to ASCII, but
that is not a given. See the sys.stdin
documentation.
Like other file objects opened in text mode, the sys.stdin
object derives from the io.TextIOBase base class; it has a
.buffer attribute pointing to the underlying buffered IO
instance (which in turn has a .raw attribute).
Wrap the sys.stdin.buffer attribute in a new io.TextIOWrapper() instance to specify a different encoding:
import io import sys input_stream = io.TextIOWrapper(sys.stdin.buffer, encoding='utf-8')