unicode 

Send to Kindle
home » snippets » python » unicode



Notes

Specify the encoding of files

Ref:  PEP 263: Defining Python Source Code Encodings

#!/usr/bin/env python  
# -*- coding: UTF-8 -*-

Specifying unicode strings

>>> u"\N{GREEK CAPITAL LETTER DELTA}"  # Using the character name '\u0394'
>>> u"\u0394"                          # Using a 16-bit hex value '\u0394'
>>> u"\U00000394"                      # Using a 32-bit hex value '\u0394'

Convert a bytes string that is somehow typed unicode to str.

# `ISO-8859-1` aka `Latin-1` is the only encoding whose
# 256 characters are identical to the 256 first characters of
# Unicode.
import codecs
str_string = codecs.latin_1_encode(unicode_string_which_is_actually_binary)

The various encodings

locale.getpreferredencoding()
sys.getfilesystemencoding()
sys.stdin.encoding / sys.stdout.encoding / sys.stderr.encoding

Opening files.

import io
infile = io.open('UTF-8.txt', 'rt', encoding='UTF-8')

import codecs
codecs.open('UTF-8.txt', 'rt', encoding='UTF-8')

# python 3
open(filename, 'r', encoding='UTF-8')

sys.stdout's encoding

import codecs
import sys
sys.stdout = codecs.getwriter('UTF-8')(sys.stdout)

sys.stdin's encoding

For Python 3, refer: sys.stdin docs

Refer:

Python 3

Python 3 does not expect ASCII from sys.stdin.  It'll open stdin in text mode and make an educated guess as to what encoding is used.  That guess may come down to ASCII, but that is not a given. See the sys.stdin documentation.

Like other file objects opened in text mode, the sys.stdin object derives from the io.TextIOBase base class; it has a .buffer attribute pointing to the underlying buffered IO instance (which in turn has a .raw attribute).

Wrap the sys.stdin.buffer attribute in a new io.TextIOWrapper() instance to specify a different encoding:

import io
import sys

input_stream = io.TextIOWrapper(sys.stdin.buffer, encoding='utf-8')