UTF-8

UTF-8 stands for UCS Transformation Format 8 bit. It is a upward compatible way to portable encode all languages on this planet.

Another remark:
The following control characters (Range: 0x00...0x1F, 0x7F) are allowed:



UTF-8 has the following properties:

Used in APE Tag Items


Weblinks:


Windows API

MultiByteToWideChar- convert a MultiByte string to a WideChar string
    int
    MultiByteToWideChar ( UINT    CodePage,        // code page
                          DWORD   dwFlags,         // character-type options
                          LPCSTR  lpMultiByteStr,  // address of string to map
                          int     cchMultiByte,    // number of bytes in string
                          LPWSTR  lpWideCharStr,   // address of wide-character buffer
                          int     cchWideChar );   // size of buffer
Convert current locale (Multibyte) to Unicode (WideChar) and then encode to UTF-8 using the simple generic scheme below. Behaviour of function depends on locale settings of the LOCAL_MACHINE/CURRENT_USER.


ISO API

mbstowcs - convert a multibyte string to a wide character string
mbsrtowcs - convert a multibyte string to a wide character string
    #include <stdlib.h>

    size_t  mbstowcs ( wchar_t* dst, const char* src, size_t maxlen );

    #include <wchar.h>

    size_t  mbsrtowcs  ( wchar_t* dst, const char** src,             size_t maxlen, mbstate_t* ps );
    size_t  mbsnrtowcs ( wchar_t* dst, const char** src, size_t nms, size_t maxlen, mbstate_t* ps );

Interface is very similar to Windows API, but mr crptc t b mr dffclt t ndrstnd.
Convert current locale (multibyte) to Unicode (wide character) and then encode to UTF-8 using the simple generic scheme below. Behaviour of function depends on locale settings of the enviroment variable $LC_CTYPE.


The following byte sequences are used to represent a character. The sequence to be used depends on the Unicode number of the character:

Unicode Glyph Binary Represenation
of Glyph in Unicode
Byte 1 Byte 2 Byte 3 Byte 4 Byte 5 Byte 6
U-00000000 . . . U-0000007F 00000000 00000000 00000000 0xxxxxxx 0xxxxxxx
U-00000080 . . . U-000007FF 00000000 00000000 00000xxx xxyyyyyy 110xxxxx 10yyyyyy
U-00000800 . . . U-0000FFFF 00000000 00000000 xxxxyyyy yyzzzzzz 1110xxxx 10yyyyyy 10zzzzzz
U-00010000 . . . U-001FFFFF 00000000 000xxxyy yyyyzzzz zzuuuuuu 11110xxx 10yyyyyy 10zzzzzz 10uuuuuu
U-00200000 . . . U-03FFFFFF 000000xx yyyyyyzz zzzzuuuu uuvvvvvv 111110xx 10yyyyyy 10zzzzzz 10uuuuuu 10vvvvvv
U-04000000 . . . U-7FFFFFFF 0xyyyyyy zzzzzzuu uuuuvvvv vvssssss 1111110x 10yyyyyy 10zzzzzz 10uuuuuu 10vvvvvv 10ssssss

Only the shortest possible multibyte sequence which can represent the code number of the character can be used. Note that in multibyte sequences, the number of leading 1 bits in the first byte is identical to the number of bytes in the entire sequence.

Examples: The Unicode character U+00A9 = 00010 101001 (copyright sign) is encoded in UTF-8 as

    11000010     10101001 = 0xC2 0xA9

and character U+2260 = 0010 001001 100000 (not equal to) is encoded as:

    11100010 10001001 10100000 = 0xE2 0x89 0xA0


Code Charts (PDF version)

The charts in this list are arranged in code point order.
For an alphabetical index of characters and blocks, use the Unicode Character Names Index.


Chart starts at 0000Basic Latin
Chart starts at 2440Optical Character Recognition
Chart starts at 0080Latin-1 Supplement Chart starts at 2460Enclosed Alphanumerics
Chart starts at 0100Latin Extended-A Chart starts at 2500Box Drawing
Chart starts at 0180Latin Extended-B Chart starts at 2580Block Elements
Chart starts at 0250IPA Extensions Chart starts at 25A0Geometric Shapes
Chart starts at 02B0Spacing Modifier Letters Chart starts at 2600Miscellaneous Symbols
Chart starts at 0300Combining Diacritical Marks Chart starts at 2700Dingbats
Chart starts at 0370Greek Chart starts at 27D0Miscellaneous Mathematical Symbols-A
Chart starts at 0400Cyrillic Chart starts at 27F0Supplemental Arrows-A
Chart starts at 0500Cyrillic Supplement Chart starts at 2800Braille Patterns
Chart starts at 0530Armenian Chart starts at 2900Supplemental Arrows-B
Chart starts at 0590Hebrew Chart starts at 2980Miscellaneous Mathematical Symbols-B
Chart starts at 0600Arabic Chart starts at 2A00Supplemental Mathematical Operators
Chart starts at 0700Syriac Chart starts at 2E80CJK Radicals Supplement
Chart starts at 0780Thaana
Chart starts at 2F00Kangxi Radicals
Chart starts at 0900Devanagari Chart starts at 2FF0Ideographic Description Characters
Chart starts at 0980Bengali Chart starts at 3000CJK Symbols and Punctuation
Chart starts at 0A00Gurmukhi Chart starts at 3040Hiragana
Chart starts at 0A80Gujarati Chart starts at 30A0Katakana
Chart starts at 0B00Oriya Chart starts at 3100Bopomofo
Chart starts at 0B80Tamil Chart starts at 3130Hangul Compatibility Jamo
Chart starts at 0C00Telugu Chart starts at 3190Kanbun
Chart starts at 0C80Kannada Chart starts at 31A0Bopomofo Extended
Chart starts at 0D00Malayalam Chart starts at 3200Enclosed CJK Letters and Months
Chart starts at 0D80Sinhala Chart starts at 3300CJK Compatibility
Chart starts at 0E00Thai Chart starts at 3400CJK Unified Ideographs Extension A (1.5MB)
Chart starts at 0E80Lao Chart starts at 4E00CJK Unified Ideographs (5MB)
Chart starts at 0F00Tibetan Chart starts at A000Yi Syllables
Chart starts at 1000Myanmar Chart starts at A490Yi Radicals
Chart starts at 10A0Georgian Chart starts at AC00Hangul Syllables (7MB)
Chart starts at 1100Hangul Jamo High Surrogates start at D800High Surrogates
Chart starts at 1200Ethiopic Low Surrogates start at DC00Low Surrogates
Chart starts at 13A0Cherokee Private Use Area starts at E000Private Use Area
Chart starts at 1400Unified Canadian Aboriginal Syllabic Chart starts at F900CJK Compatibility Ideographs
Chart starts at 1680Ogham Chart starts at FB00Alphabetic Presentation Forms
Chart starts at 16A0Runic Chart starts at FB50Arabic Presentation Forms-A
Chart starts at 1700Tagalog Chart starts at FE00Variation Selectors
Chart starts at 1720Hanunoo Chart starts at FE20Combining Half Marks
Chart starts at 1740Buhid Chart starts at FE30CJK Compatibility Forms
Chart starts at 1760Tagbanwa Chart starts at FE50Small Form Variants
Chart starts at 1780Khmer Chart starts at FE70Arabic Presentation Forms-B
Chart starts at 1800Mongolian Chart starts at FF00Halfwidth and Fullwidth Forms
Chart starts at 1E00Latin Extended Additional Chart starts at FFF0Specials
Chart starts at 1F00Greek Extended Chart starts at 10300Old Italic
Chart starts at 2000General Punctuation Chart starts at 10330Gothic
Chart starts at 2070Superscripts and Subscripts Chart starts at 10400Deseret
Chart starts at 20A0Currency Symbols Chart starts at 1D000Byzantine Musical Symbols
Chart starts at 20D0Combining Marks for Symbols Chart starts at 1D100Musical Symbols
Chart starts at 2100Letterlike Symbols Chart starts at 1D400Mathematical Alphanumeric Symbols
Chart starts at 2150Number Forms Chart starts at 20000CJK Unified Ideographs Extension B (13MB)
Chart starts at 2190Arrows Chart starts at 2F800CJK Compatibility Ideographs Supplement
Chart starts at 2200Mathematical Operators Chart starts at E0000Tags
Chart starts at 2300Miscellaneous Technical Supplementary Private Use Area-A starts at F0000Supplementary Private Use Area-A
Chart starts at 2400Control Pictures Supplementary Private Use Area-B starts at 100000Supplementary Private Use Area-B


Microsoft Codepages (Windows 2000)

See also header file winnls.h.

CP-0 ANSI code page (CP_ACP). Here mapped to CP-1252
CP-1 OEM code page (CP_OEMCP). Here mapped to CP-850
CP-2 Mac code page (CP_MACCP). Here mapped to CP-10000
CP-3 Thread code page (CP_THREAD_ACP). Here mapped to CP-1252
CP-37 ???
CP-42 Symbols (CP_SYMBOLS)
CP-437 MS-DOS: US
CP-500 MS-DOS: IBM EDCDIC International (do you know Tyrannus Rex of the Char Sets?)
CP-850 MS-DOS: Latin 1
CP-852 MS-DOS: Latin 2
CP-855 MS-DOS: Cyrillic
CP-857 MS-DOS: Turkish
CP-860 MS-DOS: Portugal
CP-861 MS-DOS: Iceland
CP-862 MS-DOS: Israel
CP-863 MS-DOS: Canadian French
CP-864 MS-DOS: Arabic
CP-865 MS-DOS: Nordic
CP-866 MS-DOS: Cyrillic - Russian
CP-869 MS-DOS: Greek
CP-874 Thai 874.gif
CP-932 Japanese 932gif.zip - 0.9 MB
CP-936 PRC GBK (XGB) (Simplified Chinese) 936gif.zip - 2.5 MB
CP-949 Korean Extended Wansung 949gif.zip - 2.1 MB
CP-950 Chinese (Taiwan, Hong Kong) 950gif.zip - 1.7 MB
CP-1250 Latin 2 1250.gif
CP-1251 Cyrillic 1251.gif
CP-1252 Latin 1 1252.gif
CP-1253 Greek 1253.gif
CP-1254 Latin 5 1254.gif
CP-1255 Hebrew 1255.gif
CP-1256 Arabic 1256.gif
CP-1257 Baltic 1257.gif
CP-1258 Vietnam 1258.gif
CP-10000 ???
CP-10079 ???


[eMail]      [Addr]

Information taken from Markus G. Kuhn's Unicode Page and Unicode.org glyph table