- W3C I18n Site Index
- Language Codes: ISO 639 (wikipedia)
- Composed of 6 parts
- ISO 639-1: Alpha-2 code (official)
- ISO 639-2: Alpha-3 code (official)
- ISO 639-3: Alpha-3 code for comprehensive coverage (official)
- ISO 639-4: Implementation guidelines and general principles for language coding
- ISO 639-5: Alpha-3 code for language families and groups (official)
- ISO 639-6: Alpha-4 representation for comprehensive coverage of language variants (official)
- Composed of 6 parts
- Capitalization/Casing Recommendations
- ISO639-1 recommends that language codes be written in lowercase. ('mn' Mongolian).
- ISO15924 recommends that script codes use lowercase with the initial letter capitalized. ('Cyrl' Cyrillic).
- ISO3166-1 recommends that region/country codes be capitalized. ('MN' Mongolia).
- All other subtags: prefer lowercase.
- Language, Script, Region
- Country Codes: ISO 3166-1 (wikipedia)
- Country or Region codes: UN M.49
- Countries or areas, codes and abbreviations
- Composition of macro geographical (continental) regions, geographical sub-regions, and selected economic and other groupings
199→ least developed countries,
432→ landlocked developing countries,
722→ small island developing countries, etc.
- Country or area numerical codes added or changed since 1982
- Script codes (for writing): ISO 15924: Codes for the representation of names of scripts (wikipedia)
- Each script is given both a four-letter code and a numeric one.
- Script is defined as "set of graphic characters used for the written form of one or more languages".
- One could differentiate, for example,
between Serbian written in the Cyrillic (
sr-Cyrl) or Latin (
sr-Latn) script, or mark romanized text as such.
- ISO 15924: Code Lists
- Additions and Changes to ISO 15924 Codes
- BCP47 (BCP = Best Current Practices) introduces
- unicode.org: A General Method for Rendering Combining Marks
- unicode.org: Unicode Line Breaking Algorithm
- ICU Home
- Language Plural Rules
- Slides: IUC 36: Plural & Gender & More in Translated Messages
- Library of Congress >> Standards >> ISO 639.2 >> Codes for the representation of names of languages
- CLDR - Unicode Common Locale Data Repository
- Google Developers: Internationalization
- Why flags should not be used to indicate language choice
- Closure JS Library
- OS X
- Interesting Languages for test cases
- Catalan, Romanian, Russian
- Catalan is interesting for gender
- Russian is interesting for pluralization
Update: Looks like this is all nicely explained at Language tags in HTML and XML making most of the stuff below unnecessary. Why, oh why, is it so easy to find all those old docs and references but not something like this – the first useful and real reference to RFC 5646! Sigh.
- "Tag" refers to a complete language tag, such as "sr-Latn-RS" or "az-Arab-IR".
- "Subtag" refers to a specific section of a tag, delimited by a hyphen, such as the subtags 'zh', 'Hant', and 'CN' in the tag "zh- Hant-CN".
- "Code" refers to values defined in external standards (and that are used as subtags in this document).
- For example, 'Hant' is an ISO15924 script code that was used to define the 'Hant' script subtag for use in a language tag. Examples of codes in this document are enclosed in single quotes ('en', 'Hant').
Language tags are designed so that each subtag type has unique length and content restrictions. These make identification of the subtag's type possible, even if the content of the subtag itself is unrecognized. This allows tags to be parsed and processed without reference to the latest version of the underlying standards or the IANA registry and makes the associated exception handling when parsing tags simpler.
The general formats are (refer the RFC for the ABNF grammar but note that it's not exactly very helpful—for example, the extended language subtag allows 1-3 alpha-3 codes separated by a hyphen and then later mentions that only 1 is legal, the other variations are permanently reserver and specifying them will always be invalid):
<PrimaryLanguage>is always the first subtag and is required.
- It can contain a hyphen. It's composed of
<MacroLanguage>-<ExtendedLanguage>where at least one of them must be specified.
<MacroLanguage>is either an alpha-2 code, or where an alpha-2 code doesn't exist, an alpha-3 code. It's value comes from the official IANA assignments list (look only for
<ExtendedLanguage>can only be an alpha-3 code. Again, look at the official IANA assignments list and filter for records with
- Note that the IANA records for an extended
language are always required to correspond to
MacroLanguage. This is implicit when parsed.
- Note that the IANA records for an extended language are always required to correspond to exactly one
- Parsing: How do you know when you're done
parsing the PrimaryLanguage subtag (which can be one
ExtendedLanguagecan only be 2 or 3 characters and always ASCII letters. Specifically, they can't be numbers.
- If the next hypenated piece isn't exactly 2 or 3 ASCII letters (letters, not digits), it's not part of the language subtag.
- This is because, the following piece can either
Scriptis always 4 characters when present.
Regionis either 2 letters or 3 digits.
Variantsubtags that begin with a letter at least 5 characters long and those that begin with a digit are at least 4 characters long.
Extensionsubtags are always preceded by a single character subtag so seeing such a subtag also means the end of parsing something "useful" (not just when parsing the Language subtag.) You don't need this additional rule since this is covered by first two rules.
- It can contain a hyphen. It's composed of
Scriptis optional. When present, it's always a 4 letter code defined in ISO 15924: List of four-letter script codes. Filter the official IANA assignments list for records with
Regionis optional. It identifies a country/region/geographic area. When present, it's always a 2 letter code or a 3 digit code. Filter the official IANA assignments list for records with
Type: region. This is typically ISO 3166-1 Alpha 2 codes and numeric codes from UN M.49.
Variantsubtags are used to indicate additional, well-recognized variations that define a language or its dialects that are not covered by other available subtags. You can have 0 or more of these.
- Those that begin with a letter are at least 5 in length and those beginning with a digit are at least 4 in length. Further, they must be distinct (ignoring case.)
- Filter the official IANA assignments
for records with
Ref: ISO 639-3 and Macro
(Note that that page is a bit outdated and talks about stuff
that has already happened. Also, the macro language is
not required and instead, the variations such as
zh-cmn-Hant are marked redundant and the
preferred tags are instead just
respectively.) (Aside: The redundant variations, along with
optional preferred versions, can be obtained by filtering
records in the official IANA assignments
Type: redundant. For example, many sign languages.)
ISO 639 has occasionally assigned codes to "macro-languages", which are language families that contain a number of recognizably related (but not necessarily mutually intelligible) languages. A good example of this is Chinese.
The ISO 639-1 code 'zh' identifies "Chinese", but the concept of Chinese encloses a number of distinct languages or dialects that share certain traits. While these languages are written very similarly, spoken content is very different indeed. The available regional options are poor proxies for the spoken dialects (many of which are confined to mainland China).
Mandarin Chinese (a spoken variation) is identified by
the ISO 639-3 code
More from RFC 5646
- Some of the subtags in the IANA registry do not come from an underlying standard. These can only appear in specific positions in a tag—they can only occur as primary language subtags or as variant subtags.
- Sequences of private use and extension
subtags MUST occur at the end of the
sequence of subtags and MUST NOT be
interspersed with subtags defined elsewhere
in this document. These sequences are
introduced by single-character subtags,
which are reserved as follows:
- The single-letter subtag 'x' introduces a sequence of private use subtags.
- The single-letter subtag 'i' is used by some grandfathered tags, such as "i-default", where it always appears in the first position and cannot be confused with an extension.
- All other single-letter and single-digit subtags are reserved to introduce standardized extension subtag sequences as described in Section 3.7: Extensions and the Extensions Registry
- Extended Language Subtags
- Extended language subtags consist solely of three-letter subtags.
- All extended language subtag records defined in the registry were defined according to the assignments found in ISO639-3.
- Language collections and groupings, such as defined in ISO639-5, are specifically excluded from being extended language subtags.