RFC 5646

Ref: RFC 5646: Tags for Identifying Languages (obsoletes rfc4646)

Update:  Looks like this is all nicely explained at Language tags in HTML and XML making most of the stuff below unnecessary.  Why, oh why, is it so easy to find all those old docs and references but not something like this – the first useful and real reference to RFC 5646!  Sigh.

2.2. Language Subtag Sources and Interpretation

Language tags are designed so that each subtag type has unique length and content restrictions.  These make identification of the subtag's type possible, even if the content of the subtag itself is unrecognized.  This allows tags to be parsed and processed without reference to the latest version of the underlying standards or the IANA registry and makes the associated exception handling when parsing tags simpler.

The general formats are (refer the RFC for the ABNF grammar but note that it's not exactly very helpful—for example, the extended language subtag allows 1-3 alpha-3 codes separated by a hyphen and then later mentions that only 1 is legal, the other variations are permanently reserver and specifying them will always be invalid):


Macro Languages

Ref:  ISO 639-3 and Macro Languages (Note that that page is a bit outdated and talks about stuff that has already happened.  Also, the macro language is not required and instead, the variations such as zh-cmn, zh-cmn-Hant are marked redundant and the preferred tags are instead just cmn and cmn-Hant respectively.)  (Aside: The redundant variations, along with optional preferred versions, can be obtained by filtering records in the official IANA assignments list for Type: redundant.  For example, many sign languages.)

ISO 639 has occasionally assigned codes to "macro-languages", which are language families that contain a number of recognizably related (but not necessarily mutually intelligible) languages.  A good example of this is Chinese.

The ISO 639-1 code 'zh' identifies "Chinese", but the concept of Chinese encloses a number of distinct languages or dialects that share certain traits.  While these languages are written very similarly, spoken content is very different indeed. The available regional options are poor proxies for the spoken dialects (many of which are confined to mainland China).

Mandarin Chinese (a spoken variation) is identified by the ISO 639-3 code cmn.

