Send to Kindle
home » snippets » unicode



Specify the encoding of files

Python source files

#!/usr/bin/env python  
# -*- coding: UTF-8 -*-

XML source files

The default encoding if nothing else is specified is utf-8.

If no encoding declaration is present in the XML document (and no external encoding declaration mechanism such as the HTTP header is available), the assumed encoding of an XML document depends on the presence of the Byte-Order-Mark (BOM).

The BOM is a Unicode special marker placed at the top of the file that indicate its encoding. The BOM is optional for UTF-8.

<?xml version="1.0" encoding="iso-8859-1"?>

HTML source files

When HTML documents are served there are three ways to tell the browser what specific character encoding is to be used for display to the reader. First, HTTP headers can be sent by the web server along with each web page (HTML document). A typical HTTP header looks like this:

Content-Type: text/html; charset=ISO-8859-1

For HTML (not usually XHTML), the other method is for the HTML document to include this information at its top, inside the HEAD element.

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

XHTML documents have a third option: to express the character encoding in the XML preamble, for example

<?xml version="1.0" encoding="ISO-8859-1"?>

The HTTP header specification supersedes all HTML (or XHTML) meta tag specifications, which can be a problem if the header is incorrect and one does not have the access or the knowledge to change them.

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
<html lang="zh-CN">

You can also not declare the encoding, keep it 7 bit, and use &#NNNN; html/xml entity references for the unicode code points. Python's encoding functions can accept errors="xmlcharrefreplace" to do that instead of sending across utf-8. However, this is not so great because editors can't show you a readable representation of such a file, it takes too much space, and there's great support for utf8 since IE 5.0

CSS source files

The encoding of a CSS file is determined according the following rules:

  1. If the file uses HTTP: By the HTTP charset parameter in the Content-Type field.
  2. By the value for the @charset command at the top of the CSS file.
  3. By the declaration mechanism of the referencing document, if one exists. For example in XHTML: the charset attribute of the <link> element.

For example, to specify iso-8859-1 (Latin-1) encoding:

@charset "iso-8859-1"

Lua source files

See http://lua-users.org/wiki/LuaUnicode

Basically, you can have LUA source code. However, you can't use unicode for any LUA identifiers (since LUA uses isalpha, etc. for identifying those). There's also no unicode aware string operations provided. It's mostly just blind to it.