UTF-8 stands for UCS Transformation Format 8 bit. It is a upward compatible way to portable encode all languages on this planet.
Another remark:
The following control characters (Range: 0x00...0x1F, 0x7F) are allowed:
UTF-8 has the following properties:
Used in APE Tag Items
int MultiByteToWideChar ( UINT CodePage, // code page DWORD dwFlags, // character-type options LPCSTR lpMultiByteStr, // address of string to map int cchMultiByte, // number of bytes in string LPWSTR lpWideCharStr, // address of wide-character buffer int cchWideChar ); // size of bufferConvert current locale (Multibyte) to Unicode (WideChar) and then encode to UTF-8 using the simple generic scheme below. Behaviour of function depends on locale settings of the LOCAL_MACHINE/CURRENT_USER.
#include <stdlib.h> size_t mbstowcs ( wchar_t* dst, const char* src, size_t maxlen ); #include <wchar.h> size_t mbsrtowcs ( wchar_t* dst, const char** src, size_t maxlen, mbstate_t* ps ); size_t mbsnrtowcs ( wchar_t* dst, const char** src, size_t nms, size_t maxlen, mbstate_t* ps );Interface is very similar to Windows API, but mr crptc t b mr dffclt t ndrstnd.
The following byte sequences are used to represent a character. The sequence to be used depends on the Unicode number of the character:
Unicode Glyph | Binary Represenation of Glyph in Unicode | Byte 1 | Byte 2 | Byte 3 | Byte 4 | Byte 5 | Byte 6 |
U-00000000 . . . U-0000007F | 00000000 00000000 00000000 0xxxxxxx | 0xxxxxxx | |||||
U-00000080 . . . U-000007FF | 00000000 00000000 00000xxx xxyyyyyy | 110xxxxx | 10yyyyyy | ||||
U-00000800 . . . U-0000FFFF | 00000000 00000000 xxxxyyyy yyzzzzzz | 1110xxxx | 10yyyyyy | 10zzzzzz | |||
U-00010000 . . . U-001FFFFF | 00000000 000xxxyy yyyyzzzz zzuuuuuu | 11110xxx | 10yyyyyy | 10zzzzzz | 10uuuuuu | ||
U-00200000 . . . U-03FFFFFF | 000000xx yyyyyyzz zzzzuuuu uuvvvvvv | 111110xx | 10yyyyyy | 10zzzzzz | 10uuuuuu | 10vvvvvv | |
U-04000000 . . . U-7FFFFFFF | 0xyyyyyy zzzzzzuu uuuuvvvv vvssssss | 1111110x | 10yyyyyy | 10zzzzzz | 10uuuuuu | 10vvvvvv | 10ssssss |
Only the shortest possible multibyte sequence which can represent the code number of the character can be used. Note that in multibyte sequences, the number of leading 1 bits in the first byte is identical to the number of bytes in the entire sequence.
Examples: The Unicode character U+00A9 = 00010 101001 (copyright sign) is encoded in UTF-8 as
11000010 10101001 = 0xC2 0xA9
and character U+2260 = 0010 001001 100000 (not equal to) is encoded as:
11100010 10001001 10100000 = 0xE2 0x89 0xA0
Code Charts (PDF version)
The charts in this list are arranged in code point order.
For an alphabetical index of characters and blocks, use the Unicode Character Names Index.
Microsoft Codepages (Windows 2000)
See also header file winnls.h.
CP-0 | ANSI code page (CP_ACP). Here mapped to CP-1252 | |
CP-1 | OEM code page (CP_OEMCP). Here mapped to CP-850 | |
CP-2 | Mac code page (CP_MACCP). Here mapped to CP-10000 | |
CP-3 | Thread code page (CP_THREAD_ACP). Here mapped to CP-1252 | |
CP-37 | ??? | |
CP-42 | Symbols (CP_SYMBOLS) | |
CP-437 | MS-DOS: US | |
CP-500 | MS-DOS: IBM EDCDIC International (do you know Tyrannus Rex of the Char Sets?) | |
CP-850 | MS-DOS: Latin 1 | |
CP-852 | MS-DOS: Latin 2 | |
CP-855 | MS-DOS: Cyrillic | |
CP-857 | MS-DOS: Turkish | |
CP-860 | MS-DOS: Portugal | |
CP-861 | MS-DOS: Iceland | |
CP-862 | MS-DOS: Israel | |
CP-863 | MS-DOS: Canadian French | |
CP-864 | MS-DOS: Arabic | |
CP-865 | MS-DOS: Nordic | |
CP-866 | MS-DOS: Cyrillic - Russian | |
CP-869 | MS-DOS: Greek | |
CP-874 | Thai | 874.gif |
CP-932 | Japanese | 932gif.zip - 0.9 MB |
CP-936 | PRC GBK (XGB) (Simplified Chinese) | 936gif.zip - 2.5 MB |
CP-949 | Korean Extended Wansung | 949gif.zip - 2.1 MB |
CP-950 | Chinese (Taiwan, Hong Kong) | 950gif.zip - 1.7 MB |
CP-1250 | Latin 2 | 1250.gif |
CP-1251 | Cyrillic | 1251.gif |
CP-1252 | Latin 1 | 1252.gif |
CP-1253 | Greek | 1253.gif |
CP-1254 | Latin 5 | 1254.gif |
CP-1255 | Hebrew | 1255.gif |
CP-1256 | Arabic | 1256.gif |
CP-1257 | Baltic | 1257.gif |
CP-1258 | Vietnam | 1258.gif |
CP-10000 | ??? | |
CP-10079 | ??? |
Information taken from Markus G. Kuhn's Unicode Page and Unicode.org glyph table