Normative Addendum 1 embodies C's reaction to both the limitations
and promises of international character sets.
Digraphs and the
<iso646.h> header were meant to improve the appearance of C
programs written in national variants of ISO 646 without, e.g., {
or } characters.
On the other end of the spectrum, the facilities
connected to <wchar.h> and <wctype.h>
extend the old Standard's barely adequate basis into a complete and
consistent set of utilities for handling wide characters and multibyte strings.
This document summarizes Normative Addendum 1. It is intended to quickly inform readers who are already familiar with the Standard; it does not, and cannot, introduce the complex subject matter behind NA1, nor can it replace the original document as a reference manual. (Nevertheless, it tries to be as accurate as possible, and its author would like to hear about any errors or omissions.)
STDC_VERSION__ shall expand to
199409L.
(The Normative Addendum was formally registered with ISO in September 1994.)
<: :> <% %> %: %:%:These tokens behave identically to the tokens and preprocessing tokens:
[ ] { } # ##
respectively (except that they are spelled differently,
and so stringize differently).
%: and %:%:.
#define and &&
#define and_eq &=
#define bitand &
#define bitor |
#define compl ~
#define not !
#define not_eq !=
#define or ||
#define or_eq |=
#define xor ^
#define xor_eq ^=
These macro names are reserved for all purposes in translation
units that include the header,
but are not reserved in those that do not (this is
the same as for any other Standard macros).
wchar_t.
Not all code values have to represent a character;
those that do not must not appear in wide
strings that are converted to multibyte characters.
Code value 0 is reserved for the ``end of string''
indicator.
char).
A character can have representations in more than
one state, and can have more than one representation
in any given state. The representation
in different states can differ.
Not all byte sequences are necessarily valid;
an invalid sequence causes an encoding error
when interpreted (normally shown by setting errno
to EILSEQ).
However, for encodings used by other library functions, there are further restrictions:
fwprintf);
all these identifiers are declared by <wctype.h>
or <wchar.h>.
These identifiers are reserved with external linkage in
all the translation units of a program if and only if
any translation unit includes either of those
headers (thus changes in one translation unit may cause another
translation unit to invoke undefined behavior).
EILSEQ is added to the list of error
conditions (currently this list consists of EDOM
and ERANGE).
typedef ... wint_t;
WEOF (described
below).
It can be the same type as wchar_t.
typedef ... wctrans_t;typedef ... wctype_t;
wctype_t represents a
classification of characters (like ``is lower case'' or
``is accented''), while wctrans_t
represents a character conversion (like ``change to upper case'' or
``remove any accent'').
wint_t.
It need not be negative nor equal EOF,
but it serves the same purpose:
the value, which must not be a valid wide character, is used to
represent an end of file or as an error indication.
LC_CTYPE category
of the current locale.
int iswalnum (wint_t);int iswalpha (wint_t);int iswcntrl (wint_t);int iswdigit (wint_t);int iswgraph (wint_t);int iswlower (wint_t);int iswprint (wint_t);int iswpunct (wint_t);int iswspace (wint_t);int iswupper (wint_t);int iswxdigit (wint_t);
WEOF or representable as
a wchar_t.
The function will
return nonzero if and only if the argument is a wide character of the
appropriate type.
The types are the same as for the <ctype.h>
functions, except that iswprint and iswgraph
are guaranteed to return false not only for
space (as their char counterparts do),
but for any character
that iswspace() considers white space.
Thus isgraph('\t') is true,
but iswgraph(L'\t') is false.
For the remaining nine functions the expression
(!isXXXXX(wctob(wc)) || iswXXXXX(wc))
is true for every wide
character.
That is, for any wide character which has a corresponding
singlebyte character (which is what
wctob returns),
if the latter has the given property, then so does the
former.
Note that this is not a symmetric relationship.
wctype_t wctype (const char *);int iswctype (wint_t, wctype_t);
isXXXXX
or iswXXXXX functions to
test for other properties (e.g. ``is a katakana character''),
it was felt that this cluttered the namespace (though the names are
all reserved) without being flexible enough for
future needs.
Instead, the committee introduced a mechanism that can be extended
at run-time.
wctype()
names a category to test for; wctype()
returns a wctype_t magic cookie that can
be handed to iswctype to test for the
named category, or zero if it does not recognize the
category.
The eleven builtin categories "alnum",
"alpha", ... "xdigit"
must be recognized by all
implementations.
Thus, iswctype(ch, wctype("punct"))
is the same as
iswpunct(ch).
The wctype_t value is only valid for the
LC_CTYPE category used to create it.
wint_t towlower (wint_t);wint_t towupper (wint_t);
toupper and tolower. toupper('é') == 'E', towupper(L'é') == L'É'
wctrans_t wctrans (const char *);wint_t towctrans (wint_t, wctrans_t);
wctype() and
iswctype() provide extensible tests.
struct tm;typedef ... size_t;typedef ... wchar_t;typedef ... wint_t;#define NULL ...#define WEOF ...
struct tm ;
it is still necessary to include <time.h>
before defining a
variable of this type.
typedef ... mbstate_t;
WCHAR_MAX and
WCHAR_MIN
wchar_t can hold.
They are integral
constant expressions of type wchar_t,
but not necessarily valid
as wide characters.
For example, if wchar_t is a typedef for
unsigned short, then
WCHAR_MIN will be zero
and WCHAR_MAX will
be the same as USHRT_MAX.
mbstate_t, and an orientation;
it can be byteoriented, wideoriented,
or unoriented.
When a stream is opened (including stdin etc.,
and calls to freopen), it is
unoriented.
The functions ungetc, fgetc,
fputc, and those defined to work though them,
change an unoriented stream to byteoriented, and shall
not be called on a wideoriented stream.
The functions ungetwc, fgetwc,
fputwc, and those defined to work though them,
change an unoriented stream to wideoriented,
and shall not be called on a byteoriented stream.
Wide binary streams shall obey the positioning restrictions of both text and binary streams. Positioning a wideoriented stream within the middle of an existing character representation and then writing makes all following contents undefined.
The mbstate_t object associated with a
stream is saved by fgetpos and restored
by fsetpos.
The object is initialized when the stream is opened as if it were
an object declared with static lifetime (i.e. all
zeroes and null pointers).
The *scanf and *printf
functions have the ability to handle strings of
the opposite type to the majority (that is,
wide strings in fprintf etc.
and multibyte strings in fwprintf etc.).
These strings are converted to the majority form before
(for *printf) or after (for *scanf)
any other processing.
This conversion is done as if using calls to
mbrtowc or
wcrtomb,
but with an mbstate_t
object set to the initial state before each
such conversion.
wint_t fgetwc (FILE *);
mbrtowc
(using the stream's
mbstate_t object) until a complete wide
character has been read, or an error
occurs.
The character or WEOF is returned; the latter can indicate
end of file (the eof indicator is set), a read error (the error
indicator is set), or a conversion error (errno is set to
EILSEQ).
All other wide character
input is done as if via fgetwc.
wint_t fputwc (wchar_t, FILE *);
wcrtomb
(using the
stream's mbstate_t object)
and writes the resulting bytes to the stream.
The character or WEOF is
returned; the latter can indicate a write error (the error
indicator is set) or a conversion error
(errno is set to EILSEQ).
All other wide character output is done as if via fputwc.
fprintf (and
printf and sprintf):
%lc,
which requires a wint_t argument,
and %ls,
which requires a wchar_t *
argument.
%lc is equivalent to %ls called with
a two element array (the argument in
the first element, and zero in the second).
%ls converts the wide characters to bytes;
the precision indicates the maximum number of bytes
written (conversion will also stop on a zero wide character);
a partial multibyte character will not be output,
though complete trailing shift sequences might be.
fscanf (and
scanf and sscanf):
%lc, %ls, and %l[;
all take a pointer to wchar_t,
and convert the input to multibyte representation after
matching.
(The qualified and unqualified conversions match the same input.)
int fwprintf (FILE *, const wchar_t *, ...);int wprintf (const wchar_t *, ...);int swprintf (wchar_t *, size_t, const wchar_t *, ...);int vfwprintf (FILE *, const wchar_t *, va_list);int vwprintf (const wchar_t *, va_list);int vswprintf (wchar_t *, size_t, const wchar_t*, va_list);
fprintf,
including the extensions
above.
With %c, the character is converted
using btowc;
with %s, the string
is converted to wide characters before output.
With all formats, width and precision are measured in wide
characters.
The second argument of
swprintf is the the number of elements
of the destination array
(including the terminating zero which is always written).
int fwscanf (FILE *, const wchar_t *, ...);int wscanf (const wchar_t *, ...);int swscanf (const wchar_t *, const wchar_t *, ...);
fscanf,
including the extensions above.
With %c, %s, and %[,
the accepted input field will be converted
to its multibyte equivalent after being matched.
With all formats, width and precision are
measured in wide characters.
wchar_t *fgetws (wchar_t *, int, FILE *);int fputws (const wchar_t *, FILE *);wint_t getwc (FILE *);wint_t getwchar (void);wint_t putwc (wchar_t, FILE *);wint_t putwchar (wchar_t);wint_t ungetwc (wint_t, FILE *);
getwc
and putwc's FILE * argument.)
int fwide (FILE *, int);
double wcstod (const wchar_t *, wchar_t **);long int wcstol (const wchar_t *, wchar_t **, int);unsigned long int wcstoul (const wchar_t*, wchar_t**, int);wchar_t *wcscpy (wchar_t *, const wchar_t *);wchar_t *wcsncpy (wchar_t *, const wchar_t *, size_t);wchar_t *wcscat (wchar_t *, const wchar_t *);wchar_t *wcsncat (wchar_t *, const wchar_t *, size_t);int wcscmp (const wchar_t *, const wchar_t *);int wcscoll (const wchar_t *, const wchar_t *);int wcsncmp (const wchar_t *, const wchar_t *, size_t);size_t wcsxfrm (wchar_t *, const wchar_t *, size_t);wchar_t *wcschr (const wchar_t *, wchar_t);size_t wcscspn (const wchar_t *, const wchar_t *);wchar_t *wcspbrk (const wchar_t *, const wchar_t *);wchar_t *wcsrchr (const wchar_t *, wchar_t);size_t wcsspn (const wchar_t *, const wchar_t *);wchar_t *wcsstr (const wchar_t *, const wchar_t *);size_t wcslen (const wchar_t *);wchar_t *wmemchr (const wchar_t *, wchar_t, size_t);int wmemcmp (const wchar_t *, const wchar_t *, size_t);wchar_t *wmemcpy (wchar_t *, const wchar_t *, size_t);wchar_t *wmemmove (wchar_t *, const wchar_t *, size_t);wchar_t *wmemset (wchar_t *, wchar_t, size_t);size_t wcsftime (wchar_t *, size_t, const wchar_t *, const struct tm *);
wchar_t *wcstok (wchar_t*, const wchar_t*, wchar_t**);
strtok,
but uses the object pointed to
by the third argument to keep state, rather than keeping it
internally as strtok does.
This change makes it possible to interleave
calls to wcstok over different input strings.
mbstate_t
object that they keep their conversion state in.
Such an object can be set to all zeroes (e.g. by
assigning to it the value of an mbstate_t
object with static lifetime which has not been explicitly
initialized)
and is then in its initial state.
When an object is in the initial state
(no matter how this occurred),
it is prepared for conversion in either direction
(from multibyte to wide characters or vice versa)
starting in the initial state.
Once an object has left its initial state
(which happens whenever it is used with one
of the following functions unless the description says otherwise),
it shall only be used in the same
LC_CTYPE category [*]
and same direction as the previous call,
and shall not be used after a conversion error.
If a null pointer is passed, each
function uses its own internal object
which is initialized to all zeroes at program startup.
mbstate_t object associated with a stream is bound
to an encoding by the first fgetwc or fputwc
call after the stream is opened, and can then be used with any locale.
wint_t btowc (int);
unsigned char)
to the corresponding wide character, if any, or else returns
WEOF.
int wctob (wint_t);
EOF.
int mbsinit (const mbstate_t *);
mbstate_t object is
in the initial state (the object is unaffected).
size_t mbrlen (const char *s, size_t n, mbstate_t *pcs);
mbrtowc(NULL,
s, n,
pcs), except
that it uses its own internal mbstate_t object,
not that of mbrtowc, when given a null pointer.
size_t mbrtowc
(wchar_t *ws,
const char *s,
size_t n,
mbstate_t *pcs);s (inspecting no
more than n bytes) to a wide character.
If ws is not a null pointer, the wide character
is stored in *ws.
If s is a null pointer, mbrtowc
ignores ws and n and acts as if the first
three arguments are a null pointer, an empty string, and 1 respectively.
(size_t)-2mbstate_t, but no
complete wide character has been found.
(size_t)-10mbstate_t object has been restored to the initial state.
mbstate_t object has been updated.
mbstate_t object; the inspected
bytes do not need to be
passed to the function a second time.
size_t wcrtomb (char *, wchar_t, mbstate_t *);
MB_CUR_MAX bytes and
places them in the array pointed to by the
first argument; if the wide character is zero,
the resulting sequence will end in the initial
state,
followed by a zero byte, and the mbstate_t
object will be in the initial state.
wcrtomb returns the number of bytes written to the
character buffer, or (size_t)-1 to indicate an encoding
error (errno is set to EILSEQ).
size_t mbsrtowcs (wchar_t *ws, const char **ps, size_t n,
mbstate_t *pcs);
*ps to wide characters.
The result is either (size_t)-1 if a
conversion error occurs (in which case errno is set to
EILSEQ), or else the number
of bytes processed.
ws is a null pointer,
processing stops at the end of the string
(the terminating zero byte is not counted in the returned value),
and *pcs will be set to the initial state.
ws is not a null pointer,
the resulting wide character sequence
is stored in the array it points to.
Conversion stops when:
n wide characters have been stored;
*pcs will be set to the conversion state
after processing the indicated number of bytes,
and *ps will point to the first unprocessed byte
*pcs will be set to the initial state,
*ps will be set to a null pointer, and a zero
wide character will have been stored.
size_t wcsrtombs (char *s, const wchar_t **pws, size_t n,
mbstate_t *pcs);pws
to a multibyte character sequence.
The result is either (size_t)-1 if a conversion
error occurs (in which case errno is set to
EILSEQ), or else the number of bytes in the
resulting multibyte string.
Processing of the wide string stops either when a zero wide
character - indicating the end of the wide string - is reached
(the resulting multibyte string will end with a zero byte
which is not included in the returned result), or (if s
is not a null pointer) when it is not possible to process another wide
character without placing more than n bytes into the
array pointed to by
s. In the first case, *pcs
will be left in the initial state.
If s is a null pointer, the value of n
is ignored. Otherwise *pws will
be set to either a null pointer (if conversion stopped on a
zero wide character) or a pointer to the first unprocessed
wide character. In the latter case, the returned
value will be at least (n-MB_CUR_MAX+1).
<wctype.h> reserves function names beginning
with is or to followed by a lowercase
letter.
<wchar.h> reserves function names beginning
with wcs followed by a lowercase letter.
Lowercase letters are reserved as conversion
specifiers for fwprintf and fwscanf.