What are multibyte characters?

Multibyte characters are another way to make internationalized programs easier to write. Specifically, they help support languages such as Chinese and Japanese that could never fit into eight-bit characters. If your programs will never need to deal with any language but English, you don’t need to know about multibyte characters.
Inconsiderate as it might seem, in a world full of people who might want to use your software, not everybody reads English. The good news is that there are standards for fitting the various special characters of European languages into an eight-bit character set. (The bad news is that there are several such standards, and they don’t agree.)
Go to Asia, and the problem gets more complicated. Some languages, such as Japanese and Chinese, have more than 256 characters. Those will never fit into any eight-bit character set. (An eight-bit character can store a number between 0 and 255, so it can have only 256 different values.)
The good news is that the standard library has the beginnings of a solution to this problem. <stddef.h> defines a type, wchar_t, that is guaranteed to be long enough to store any character in any language a C program can deal with. Based on all the agreements so far, 16 bits is enough. That’s often a short, but it’s better to trust that the compiler vendor got wchar_t right than to get in trouble if the size of a short changes.
The mblen, mbtowc, and wctomb functions transform byte strings into multibyte characters. See your compiler manuals for more information on these functions.

Tagged . Bookmark the permalink.

Leave a Reply