php-src/README.UNICODE-UPGRADES

This document attempts to describe portions of the API related to the new
Unicode functionality and the best practices for upgrading existing
functions to support Unicode.

Your first stop should be README.UNICODE: it covers the general Unicode
functionality and concepts without going into technical implementation
details.

Working in Unicode World
========================

Strings
-------

A lot of internal functionality is controlled by the unicode.semantics
switch. Its value is found in the Unicode globals variable, UG(unicode). It
is either on or off for the entire request.

The big thing is that there are two new string types: IS_UNICODE and
IS_BINARY. The former one has its own storage in the value union part of
zval (value.ustr) and the latter re-uses value.str.

Both types have new macros to set the zval value and to access it.

Z_USTRVAL(), Z_USTRLEN()
 - accesses the value and length (in code units) of the Unicode type string

Z_BINVAL(), Z_BINLEN()
 - accesses the value and length of the binary type string

Z_UNIVAL(), Z_UNILEN()
 - accesses either Unicode or native string value, depending on the current
 setting of UG(unicode) switch. The Z_UNIVAL() type resolves to char*, so
 you may need to cast it appropriately.

Z_USTRCPLEN()
 - gives the number of codepoints in the Unicode type string

ZVAL_BINARY(), ZVAL_BINARYL()
 - Sets zval to hold a binary string. Takes the same parameters as
   Z_STRING(), Z_STRINGL().

ZVAL_UNICODE, ZVAL_UNICODEL()
 - Sets zval to hold a Unicode string. Takes the same parameters as
   Z_STRING(), Z_STRINGL().

ZVAL_ASCII_STRING(), ZVAL_ASCII_STRINGL()
 - When UG(unicode) is off, it's equivalent to Z_STRING(), ZSTRINGL(). When
   UG(unicode) is on, it sets zval to hold a Unicode representation of the
   passed-in ASCII string. It will always create a new string in
   UG(unicode)=1 case, so the value of the duplicate flag is not taken into
   account.

ZVAL_RT_STRING()
 - When UG(unicode) is off, it's equivalent to Z_STRING(), Z_STRINGL(). WHen
   UG(unicode) is on, it takes the input string, converts it to Unicode
   using the runtime_encoding converter and sets zval to it. Since a new
   string is always created in this case, the value of the duplicate flag
   does not matter.

ZVAL_TEXT()
 - This macro sets the zval to hold either a Unicode or a normal string,
   depending on the value of UG(unicode). No conversion happens, so the
   argument has to be cast to (char*) when using this macro. One example of
   its usage would be to initialize zval to hold the name of a user
   function.

There are, of course, related conversion macros.

convert_to_string_with_converter(zval *op, UConverter *conv)
 - converts a zval to native string using the specified converter, if necessary.

convert_to_binary()
 - converts a zval to binary string.

convert_to_unicode()
 - converts a zval to Unicode string.

convert_to_unicode_with_converter(zval *op, UConverter *conv)
 - converts a zval to Unicode string using the specified converter, if
   necessary.

convert_to_text(zval *op)
 - converts a zval to either Unicode or native string, depending on the
   value of UG(unicode) switch

zend_ascii_to_unicode() function can be used to convert an ASCII char*
string to Unicode. This is useful especially for inline string literals, in
which case you can simply use USTR_MAKE() macro, e.g.:

   UChar* ustr;

   ustr = USTR_MAKE("main");

If you need to initialize a few such variables, it may be more efficient to
use ICU macros, which avoid the conversion, depending on the platform. See
[1] for more information.

USTR_FREE() can be used to free a UChar* string safely, since it checks for
NULL argument. USTR_LEN() takes either a UChar* or a char* argument,
depending on the UG(unicode) value, and returns its length. Cast the
argument to char* before passing it.

The list of functions that add new array values and add object properties
has also been expanded to include the new types. Please see zend_API.h for
full listing (add_*_ascii_string_*, add_*_rt_string_*, add_*_unicode_*,
add_*_binary_*).

UBYTES() macro can be used to obtain the number of bytes necessary to store
the given number of UChar's. The typical usage is:

    char *constant_name = colon + (UG(unicode)?UBYTES(2):2);


Code Points and Code Units
--------------------------

Unicode type strings are in the UTF-16 encoding where 1 Unicode character
may be represented by 1 or 2 UChar's. Each UChar is referred to as a "code
unit", and a full Unicode character as a "code point". Consequently, number
of code units and number of code points for the same Unicode string may be
different. This has many implications, the most important of which is that
you cannot simply index the UChar* string to  get the desired codepoint.

The zval's value.ustr.len contains  actually the number of code units. To
obtain the number of code points, one can use u_counChar32() ICU API
function or Z_USTRCPLEN() macro.

ICU provides a number of macros for working with UTF-16 strings on the
codepoint level [2]. They allow you to do things like obtain a codepoint at
random code unit offset, move forward and backward over the string, etc.
There are two versions of iterator macros, *_SAFE and *_UNSAFE. It is strong
recommended to use *_SAFE version, since they handle unpaired surrogates and
check for string boundaries. Here is an example of how to move through
UChar* string and work on codepoints.

    UChar *str = ...;
    int32_t str_len = ...;
    UChar32 codepoint;
    int32_t offset = 0;

    while (offset < str_len) {
        U16_NEXT(str, offset, str_len, codepoint);
        /* now we have the Unicode character in codepoint */
    }

There is not macro to get a codepoint at a certain code point offset, but
there is a Zend API function that does it.

    inline UChar32 zend_get_codepoint_at(UChar *str, int32_t length, int32_t n);

To retrieve 3rd codepoint, you would call:

    zend_get_codepoint_at(str, str_len, 3);

If you have a UChar32 codepoint and need to put it into a UChar* string,
there is another helper function, zend_codepoint_to_uchar(). It takes
a single UChar32 and converts it to a UChar sequence (1 or 2 UChar's).

    UChar buf[8];
    UChar32 codepoint = 0x101a2;
    int8_t num_uchars;
    num_uchars = zend_codepoint_to_uchar(codepoint, buf);

The return value is the number of resulting UChar's or 0, which indicates
invalid codepoint.


Memory Allocation
-----------------

For ease of use and to reduce possible bugs, there are memory allocation
functions specific to Unicode strings. Please use them at all times when
allocating UChar's.

    eumalloc(size)
    eurealloc(ptr, size)
    eustrndup(s, length)
    eustrdup(s)

    peumalloc(size, persistent)
    peurealloc(ptr, size, persistent)

The size parameter refers to the number of UChar's, not bytes.


Hashes
------

Hashes API has been upgraded to work with Unicode and binary strings. All
hash functions that worked with string keys now have their equivalent
zend_u_hash_* API. The zend_u_hash_* functions take the type of the key
string as the second argument.

When UG(unicode) switch is on, the IS_STRING keys are upconverted to
IS_UNICODE and then used in the hash lookup.

There are two new constants that define key types:

    #define HASH_KEY_IS_BINARY 4
    #define HASH_KEY_IS_UNICODE 5

Note that zend_hash_get_current_key_ex() does not have a zend_u_hash_*
version. It returns the key as a char* pointer, you can can cast it
appropriately based on the key type.


Identifiers and Class Entries
-----------------------------

In Unicode mode all the identifiers are Unicode strings. This means that
while various structures such as zend_class_entry, zend_function, etc store
the identifier name as a char* pointer, it will actually point to UChar*
string. Be careful when accessing the names of classes, functions, and such
-- always check UG(unicode) before using them.

In addition, zend_class_entry has a u_twin field that points to its Unicode
counterpart in UG(unicode) mode. Use U_CLASS_ENTRY() macro to access the
correct class entry, e.g.:

    ce = U_CLASS_ENTRY(default_exception_ce);


Formatted Output
----------------

Since UTF-16 strings frequently contain NULL bytes, you cannot simpley use
%s format to print them out. Towards that end, output functions such as
php_printf(), spprintf(), etc now have three different formats for use with
Unicode strings:

  %r
    This format treats the corresponding argument as a Unicode string. The
    string is automatically converted to the output encoding. If you wish to
    apply a different converter to the string, use %*r and pass the
    converter before the string argument.

    UChar *class_name = USTR_NAME("ReflectionClass");
    zend_printf("%r", class_name);

  %R
    This format requires at least two arguments: the first one specifies the
    type of the string to follow (IS_STRING or IS_UNICODE), and the second
    one - the string itself. If the string is of Unicode type, it is
    automatically converted to the output encoding. If you wish to apply
    a different converter to the string, use %*R and pass the converter
    before the string argument.

    zend_throw_exception_ex(U_CLASS_ENTRY(reflection_exception_ptr), 0 TSRMLS_CC,
            "Interface %R does not exist",
            Z_TYPE_P(class_name), Z_UNIVAL_P(class_name));

  %v
    This format takes only one parameter, the string, but the expected
    string type depends on the UG(unicode) value. If the string is of
    Unicode type, it is automatically converted to the output encoding. If
    you wish to apply a different converter to the string, use %*R and pass
    the converter before the string argument.

    zend_error(E_WARNING, "%v::__toString() did not return anything",
            Z_OBJCE_P(object)->name);


Upgrading Functions
===================

Let's take a look at a couple of functions that have been upgraded to
support new string types.

substr()
--------

This functions returns part of a string based on offset and length
parameters.

    void *str;
    int32_t str_len, cp_len;
    zend_uchar str_type;

    if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "tl|l", &str, &str_len, &str_type, &f, &l) == FAILURE) {
        return;
    }

The first thing we notice is that the incoming string specifier is 't',
which means that we can accept all 3 string types. The 'str' variable is
declared as void*, because it can point to either UChar* or char*.
The actual type of the incoming string is stored in 'str_type' variable.

    if (str_type == IS_UNICODE) {
        cp_len = u_countChar32(str, str_len);
    } else {
        cp_len = str_len;
    }

If the string is a Unicode one, we cannot rely on the str_len value to tell
us the number of characters in it. Instead, we call u_countChar32() to
obtain it.

The next several lines normalize start and length parameters to fit within the
string. Nothing new here. Then we locate the appropriate segment.

    if (str_type == IS_UNICODE) {
        int32_t start = 0, end = 0;
        U16_FWD_N((UChar*)str, end, str_len, f);
        start = end;
        U16_FWD_N((UChar*)str, end, str_len, l);
        RETURN_UNICODEL((UChar*)str + start, end-start, 1);

Since codepoint (character) #n is not necessarily at offset #n in Unicode
strings, we start at the beginning and iterate forward until we have gone
through the required number of codepoints to reach the start of the segment.
Then we save the location in 'start' and continue iterating through the number
of codepoints specified by the offset. Once that's done, we can return the
segment as a Unicode string.

    } else {
        RETURN_STRINGL((char*)str + f, l, 1);
    }

For native and binary types, we can return the segment directly.


strrev()
--------

Let's look at strrev() which requires somewhat more complicated upgrade.
While one of the guidelines for upgrades is that combining sequences are not
really taken into account during processing -- substr() can break them up,
for example -- in this case, we actually should be concerned, because
reversing combining sequence may result in a completely different string. To
illustrate:

      a    (U+0061 LATIN SMALL LETTER A)
      o    (U+006f LATIN SMALL LETTER O)
    + '    (U+0301 COMBINING ACUTE ACCENT)
    + _    (U+0320 COMBINING MINUS SIGN BELOW)
      l    (U+006C LATIN SMALL LETTER L)

Reversing this would result in:

      l    (U+006C LATIN SMALL LETTER L)
    + _    (U+0320 COMBINING MINUS SIGN BELOW)
    + '    (U+0301 COMBINING ACUTE ACCENT)
      o    (U+006f LATIN SMALL LETTER O)
      a    (U+0061 LATIN SMALL LETTER A)

All of a sudden the combining marks are being applied to 'l' instead of 'o'.
To avoid this, we need to treat combininig sequences as a unit, by checking
the combining character class of each character with u_getCombiningClass().

strrev() obtains its single argument, a string, and unless the string is of
Unicode type, processes it exactly as before, simply swapping bytes around.
For Unicode case, the magic is like this:

	int32_t i, x1, x2;
	UChar32 ch;
	UChar *u_s, *u_n, *u_p;

    u_n = eumalloc(Z_USTRLEN_PP(str)+1);
    u_p = u_n;
    u_s = Z_USTRVAL_PP(str);

    i = Z_USTRLEN_PP(str);
    while (i > 0) {
        U16_PREV(u_s, 0, i, ch);
        if (u_getCombiningClass(ch) == 0) {
            u_p += zend_codepoint_to_uchar(ch, u_p);
        } else {
            x2 = i;
            do {
                U16_PREV(u_s, 0, i, ch);
            } while (u_getCombiningClass(ch) != 0);
            x1 = i;
            while (x1 <= x2) {
                U16_NEXT(u_s, x1, Z_USTRLEN_PP(str), ch);
                u_p += zend_codepoint_to_uchar(ch, u_p);
            }
        }
    }
    *u_p = 0;

The basic idea is to walk the string backwards from the end, using
U16_PREV() macro. If the combining class of the current character is 0,
meaning it's a base character and not a combining mark, we simply append it
to the new string. Otherwise, we save the location of the index and do a run
over the characters until we get to the next one with combining class 0. At
that point we append the sequence as is, without reversing, to the new
string. Voila.

Note that the code uses zend_codepoint_to_uchar() to convert full Unicode
characters (UChar32 type) to 1 or 2 UTF-16 code units (UChar type).


References
==========

[1] http://icu.sourceforge.net/apiref/icu4c/ustring_8h.html#a1

[2] http://icu.sourceforge.net/apiref/icu4c/utf16_8h.html

vim: set et ai tw=76 fo=tron21: