UTF-8

UTF-8

UTF-8 stands for UCS Transformation Format 8 bit. It is a upward compatible way to portable encode all languages on this planet.

Another remark:
The following control characters (Range: 0x00...0x1F, 0x7F) are allowed:

0x0A: Line feed (Unix Way)
0x0C: Form feed (with intrinsic line feed)

UTF-8 has the following properties:

UCS characters U+0000 to U+007F (ASCII) are encoded simply as bytes 0x00 to 0x7F (ASCII compatibility).
This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8.
All UCS characters >U+007F are encoded as a sequence of several bytes, each of which has the most significant bit set. Therefore, no ASCII byte (0x00 to 0x7F) can appear as part of any other character.
The first byte of a multibyte sequence that represents a non-ASCII character is always in the range 0xC0 to 0xFD and it indicates how many bytes follow for this character. All further bytes in a multibyte sequence are in the range 0x80 to 0xBF. This allows easy resynchronization and makes the encoding stateless and robust against missing bytes.
UTF-8 encoded characters may theoretically be up to six bytes long, however 16-bit BMP characters are only up to three bytes long.
The sorting order of Bigendian UCS-4 byte strings is preserved.
The bytes 0xFE and 0xFF are never used in the UTF-8 encoding.

Used in APE Tag Items

Weblinks:

Unicode.org
Glyph tables
Markus G. Kuhn's Unicode Page (University of Cambridge, UK)
UTF-8 sampler (Web browser test)
Codepages used by OS/2 and Windows

Windows API

MultiByteToWideChar- convert a MultiByte string to a WideChar string

    int
    MultiByteToWideChar ( UINT    CodePage,        // code page
                          DWORD   dwFlags,         // character-type options
                          LPCSTR  lpMultiByteStr,  // address of string to map
                          int     cchMultiByte,    // number of bytes in string
                          LPWSTR  lpWideCharStr,   // address of wide-character buffer
                          int     cchWideChar );   // size of buffer

Convert current locale (Multibyte) to Unicode (WideChar) and then encode to UTF-8 using the simple generic scheme below. Behaviour of function depends on locale settings of the LOCAL_MACHINE/CURRENT_USER.

ISO API

mbstowcs - convert a multibyte string to a wide character string
mbsrtowcs - convert a multibyte string to a wide character string

    #include <stdlib.h>

    size_t  mbstowcs ( wchar_t* dst, const char* src, size_t maxlen );

    #include <wchar.h>

    size_t  mbsrtowcs  ( wchar_t* dst, const char** src,             size_t maxlen, mbstate_t* ps );
    size_t  mbsnrtowcs ( wchar_t* dst, const char** src, size_t nms, size_t maxlen, mbstate_t* ps );

Interface is very similar to Windows API, but mr crptc t b mr dffclt t ndrstnd.
Convert current locale (multibyte) to Unicode (wide character) and then encode to UTF-8 using the simple generic scheme below. Behaviour of function depends on locale settings of the enviroment variable $LC_CTYPE.

The following byte sequences are used to represent a character. The sequence to be used depends on the Unicode number of the character:

Unicode Glyph Binary Represenation
of Glyph in Unicode Byte 1 Byte 2 Byte 3 Byte 4 Byte 5 Byte 6

U-00000000 . . . U-0000007F 00000000 00000000 00000000 0xxxxxxx 0xxxxxxx

U-00000080 . . . U-000007FF 00000000 00000000 00000xxx xxyyyyyy 110xxxxx 10yyyyyy

U-00000800 . . . U-0000FFFF 00000000 00000000 xxxxyyyy yyzzzzzz 1110xxxx 10yyyyyy 10zzzzzz

U-00010000 . . . U-001FFFFF 00000000 000xxxyy yyyyzzzz zzuuuuuu 11110xxx 10yyyyyy 10zzzzzz 10uuuuuu

U-00200000 . . . U-03FFFFFF 000000xx yyyyyyzz zzzzuuuu uuvvvvvv 111110xx 10yyyyyy 10zzzzzz 10uuuuuu 10vvvvvv

U-04000000 . . . U-7FFFFFFF 0xyyyyyy zzzzzzuu uuuuvvvv vvssssss 1111110x 10yyyyyy 10zzzzzz 10uuuuuu 10vvvvvv 10ssssss

Only the shortest possible multibyte sequence which can represent the code number of the character can be used. Note that in multibyte sequences, the number of leading 1 bits in the first byte is identical to the number of bytes in the entire sequence.

Examples: The Unicode character U+00A9 = 00010 101001 (copyright sign) is encoded in UTF-8 as

11000010 10101001 = 0xC2 0xA9

and character U+2260 = 0010 001001 100000 (not equal to) is encoded as:

11100010 10001001 10100000 = 0xE2 0x89 0xA0

Code Charts (PDF version)

The charts in this list are arranged in code point order.
For an alphabetical index of characters and blocks, use the Unicode Character Names Index.

Basic Latin
Optical Character Recognition

Latin-1 Supplement Enclosed Alphanumerics

Latin Extended-A Box Drawing

Latin Extended-B Block Elements

IPA Extensions Geometric Shapes

Spacing Modifier Letters Miscellaneous Symbols

Combining Diacritical Marks Dingbats

Greek Miscellaneous Mathematical Symbols-A

Cyrillic Supplemental Arrows-A

Cyrillic Supplement Braille Patterns

Armenian Supplemental Arrows-B

Hebrew Miscellaneous Mathematical Symbols-B

Arabic Supplemental Mathematical Operators

Syriac CJK Radicals Supplement

Thaana
Kangxi Radicals

Devanagari Ideographic Description Characters

Bengali CJK Symbols and Punctuation

Gurmukhi Hiragana

Gujarati Katakana

Oriya Bopomofo

Tamil Hangul Compatibility Jamo

Telugu Kanbun

Kannada Bopomofo Extended

Malayalam Enclosed CJK Letters and Months

Sinhala CJK Compatibility

Thai CJK Unified Ideographs Extension A (1.5MB)

Lao CJK Unified Ideographs (5MB)

Tibetan Yi Syllables

Myanmar Yi Radicals

Georgian Hangul Syllables (7MB)

Hangul Jamo High Surrogates

Ethiopic Low Surrogates

Cherokee Private Use Area

Unified Canadian Aboriginal Syllabic CJK Compatibility Ideographs

Ogham Alphabetic Presentation Forms

Runic Arabic Presentation Forms-A

Tagalog Variation Selectors

Hanunoo Combining Half Marks

Buhid CJK Compatibility Forms

Tagbanwa Small Form Variants

Khmer Arabic Presentation Forms-B

Mongolian Halfwidth and Fullwidth Forms

Latin Extended Additional Specials

Greek Extended Old Italic

General Punctuation Gothic

Superscripts and Subscripts Deseret

Currency Symbols Byzantine Musical Symbols

Combining Marks for Symbols Musical Symbols

Letterlike Symbols Mathematical Alphanumeric Symbols

Number Forms CJK Unified Ideographs Extension B (13MB)

Arrows CJK Compatibility Ideographs Supplement

Mathematical Operators Tags

Miscellaneous Technical Supplementary Private Use Area-A

Control Pictures Supplementary Private Use Area-B

Microsoft Codepages (Windows 2000)

CP-0	ANSI code page (`CP_ACP`). Here mapped to CP-1252
CP-1	OEM code page (`CP_OEMCP`). Here mapped to CP-850
CP-2	Mac code page (`CP_MACCP`). Here mapped to CP-10000
CP-3	Thread code page (`CP_THREAD_ACP`). Here mapped to CP-1252
CP-37	???
CP-42	Symbols (CP_SYMBOLS)
CP-437	MS-DOS: US
CP-500	MS-DOS: IBM EDCDIC International (do you know Tyrannus Rex of the Char Sets?)
CP-850	MS-DOS: Latin 1
CP-852	MS-DOS: Latin 2
CP-855	MS-DOS: Cyrillic
CP-857	MS-DOS: Turkish
CP-860	MS-DOS: Portugal
CP-861	MS-DOS: Iceland
CP-862	MS-DOS: Israel
CP-863	MS-DOS: Canadian French
CP-864	MS-DOS: Arabic
CP-865	MS-DOS: Nordic
CP-866	MS-DOS: Cyrillic - Russian
CP-869	MS-DOS: Greek
CP-874	Thai	874.gif
CP-932	Japanese	932gif.zip - 0.9 MB
CP-936	PRC GBK (XGB) (Simplified Chinese)	936gif.zip - 2.5 MB
CP-949	Korean Extended Wansung	949gif.zip - 2.1 MB
CP-950	Chinese (Taiwan, Hong Kong)	950gif.zip - 1.7 MB
CP-1250	Latin 2	1250.gif
CP-1251	Cyrillic	1251.gif
CP-1252	Latin 1	1252.gif
CP-1253	Greek	1253.gif
CP-1254	Latin 5	1254.gif
CP-1255	Hebrew	1255.gif
CP-1256	Arabic	1256.gif
CP-1257	Baltic	1257.gif
CP-1258	Vietnam	1258.gif
CP-10000	???
CP-10079	???

Unicode Glyph	Binary Represenation of Glyph in Unicode	Byte 1	Byte 2	Byte 3	Byte 4	Byte 5	Byte 6
U-00000000 . . . U-0000007F	00000000 00000000 00000000 0xxxxxxx	0xxxxxxx
U-00000080 . . . U-000007FF	00000000 00000000 00000xxx xxyyyyyy	110xxxxx	10yyyyyy
U-00000800 . . . U-0000FFFF	00000000 00000000 xxxxyyyy yyzzzzzz	1110xxxx	10yyyyyy	10zzzzzz
U-00010000 . . . U-001FFFFF	00000000 000xxxyy yyyyzzzz zzuuuuuu	11110xxx	10yyyyyy	10zzzzzz	10uuuuuu
U-00200000 . . . U-03FFFFFF	000000xx yyyyyyzz zzzzuuuu uuvvvvvv	111110xx	10yyyyyy	10zzzzzz	10uuuuuu	10vvvvvv
U-04000000 . . . U-7FFFFFFF	0xyyyyyy zzzzzzuu uuuuvvvv vvssssss	1111110x	10yyyyyy	10zzzzzz	10uuuuuu	10vvvvvv	10ssssss

Basic Latin	Optical Character Recognition
Latin-1 Supplement	Enclosed Alphanumerics
Latin Extended-A	Box Drawing
Latin Extended-B	Block Elements
IPA Extensions	Geometric Shapes
Spacing Modifier Letters	Miscellaneous Symbols
Combining Diacritical Marks	Dingbats
Greek	Miscellaneous Mathematical Symbols-A
Cyrillic	Supplemental Arrows-A
Cyrillic Supplement	Braille Patterns
Armenian	Supplemental Arrows-B
Hebrew	Miscellaneous Mathematical Symbols-B
Arabic	Supplemental Mathematical Operators
Syriac	CJK Radicals Supplement
Thaana	Kangxi Radicals
Devanagari	Ideographic Description Characters
Bengali	CJK Symbols and Punctuation
Gurmukhi	Hiragana
Gujarati	Katakana
Oriya	Bopomofo
Tamil	Hangul Compatibility Jamo
Telugu	Kanbun
Kannada	Bopomofo Extended
Malayalam	Enclosed CJK Letters and Months
Sinhala	CJK Compatibility
Thai	CJK Unified Ideographs Extension A (1.5MB)
Lao	CJK Unified Ideographs (5MB)
Tibetan	Yi Syllables
Myanmar	Yi Radicals
Georgian	Hangul Syllables (7MB)
Hangul Jamo	High Surrogates
Ethiopic	Low Surrogates
Cherokee	Private Use Area
Unified Canadian Aboriginal Syllabic	CJK Compatibility Ideographs
Ogham	Alphabetic Presentation Forms
Runic	Arabic Presentation Forms-A
Tagalog	Variation Selectors
Hanunoo	Combining Half Marks
Buhid	CJK Compatibility Forms
Tagbanwa	Small Form Variants
Khmer	Arabic Presentation Forms-B
Mongolian	Halfwidth and Fullwidth Forms
Latin Extended Additional	Specials
Greek Extended	Old Italic
General Punctuation	Gothic
Superscripts and Subscripts	Deseret
Currency Symbols	Byzantine Musical Symbols
Combining Marks for Symbols	Musical Symbols
Letterlike Symbols	Mathematical Alphanumeric Symbols
Number Forms	CJK Unified Ideographs Extension B (13MB)
Arrows	CJK Compatibility Ideographs Supplement
Mathematical Operators	Tags
Miscellaneous Technical	Supplementary Private Use Area-A
Control Pictures	Supplementary Private Use Area-B