Introduction ============ As successful as PHP has proven to be in the past several years, it is still the only remaining member of the P-trinity of scripting languages - Perl and Python being the other two - that remains blithely ignorant of the multilingual and multinational environment around it. The software development community has been moving towards Unicode Standard for some time now, and PHP can no longer afford to be outside of this movement. Surely, some steps have been taken recently to allow for easier processing of multibyte data with the mbstring extension, but it is not enabled in PHP by default and is not as intuitive or transparent as it could be. The basic goal of this document is to describe how PHP 6 will support the Unicode Standard natively. Since the full implementation of the Unicode Standard is very involved, the idea is to use the already existing, well-tested, full-featured, and freely available ICU (International Components for Unicode) library. This will allow us to concentrate on the details of PHP integration and speed up the implementation. General Remarks =============== Backwards Compatibility ----------------------- Throughout the design and implementation of Unicode support, backwards compatibility must be of paramount concern. PHP is used on an enormous number of sites and the upgrade to Unicode-enabled PHP has to be transparent. This means that the existing data types and functions must work as they have always done. However, the speed of certain operations may be affected, due to increased complexity of the code overall. Unicode Encoding ---------------- The initial version will not support Byte Order Mark. Characters are expected to be composed, Normalization Form C. Later versions will support BOM, and decomposed and other characters. Implementation Approach ======================= The implementation is done in phases. This allows for more basic and low-level implementation issues to be ironed out and tested before proceeding to more advanced topics. Legend: - TODO + finished * in progress Phase I ------- + Basic Unicode string support, including instantiation, concatenation, indexing + Simple output of Unicode strings via 'print' and 'echo' statements with appropriate output encoding conversion + Conversion of Unicode strings to/from various encodings via encode() and decode() functions + Determining length of Unicode strings via strlen() function, some simple string functions ported (substr). Phase II -------- * HTTP input request decoding + Fixing remaining string-aware operators (assignment to {}, etc) + Comparison (collation) of Unicode strings with built-in operators * Support for Unicode and binary strings in PHP streams + Support for Unicode identifiers * Configurable handling of conversion failures + \C{} escape sequence in strings Phase III --------- * Exposing ICU API - Porting all remaining functions to support Unicode and/or binary strings Encoding Names ============== All the encoding settings discussed in this document accept any valid encoding name supported by ICU. See ICU online documentation for the full list of encodings. Internal Encoding ================= UTF-16 is the internal encoding used for Unicode strings. UTF-16 consumes two bytes for any Unicode character in the Basic Multilingual Plane, which is where most of the current world's languages are represented. While being less memory efficient for basic ASCII text it simplifies the processing and makes interfacing with ICU easier, since ICU uses UTF-16 for its internal processing as well. Fallback Encoding ================= This setting specifies the "fallback" encoding for all the other ones. So if a specific encoding setting is not set, PHP defaults it to the fallback encoding. If the fallback_encoding is not specified either, it is set to UTF-8. fallback_encoding = "iso-8859-1" Runtime Encoding ================ Currently PHP neither specifies nor cares what the encoding of its strings is. However, the Unicode implementation needs to know what this encoding is for several reasons, including type coersion and encoding conversion for strings generated at runtime via function calls and casting. This setting specifies this runtime encoding. runtime_encoding = "iso-8859-1" Output Encoding =============== Automatic output encoding conversion is supported on the standard output stream. Therefore, command such as 'print' and 'echo' automatically convert their arguments to the specified encoding. No automatic output encoding is performed for anything else. Therefore, when writing to files or external resources, the developer has to manually encode the data using functions provided by the unicode extension or rely on stream encoding filters. The unicode extension provides necessary stream filters to make developers' lives easier. The existing default_charset setting so far has been used only for specifying the charset portion of the Content-Type MIME header. For several reasons, this setting is deprecated. Now it is only used when the Unicode semantics switch is disabled and does not affect the actual transcoding of the output stream. The output encoding setting takes precedence in all other cases. output_encoding = "utf-8" HTTP Input Encoding =================== To make accessing HTTP input variables easier, PHP automatically decodes HTTP GET and POST requests based on the specified encoding. If the HTTP request contains the encoding specification in the headers, then it will be used instead of this setting. If the HTTP input encoding setting is not specified, PHP falls back onto the output encoding setting, because modern browsers are supposed to return the data in the same encoding as they received it in. If the actual encoding is passed in the request itself or is found elsewhere, then the application can ask PHP to re-decode the raw input explicitly. http_input_encoding = "utf-8" Script Encoding =============== PHP scripts may be written in any encoding supported by ICU. The encoding of the scripts can be specified site-wide via an INI directive script_encoding, or with a 'declare' pragma at the beginning of the script. The reason for pragma is that an application written in Shift-JIS, for example, should be executable on a system where the INI directive cannot be changed by the application itself. The pragma setting is valid only for the script it occurs in, and does not propagate to the included files. pragma: INI setting: script_encoding = utf-8 Conversion Semantics ==================== Not all characters can be converted between Unicode and legacy encodings. Normally, when downconverting from Unicode, the default behavior of ICU converters is to substitute the missing sequence with the appropriate substitution sequence for that codepage, such as 0x1A (Control-Z) in ISO-8859-1. When upconverting to Unicode, if an encoding has a character which cannot be converted into Unicode, that sequence is replaced by the Unicode substitution character (U+FFFD). The conversion failure behavior can be customized: - perform substitution as described above with a custom substitution character - skip any invalid characters - stop the conversion, raise an error, and return partial conversion results - replace the missing character with a diagnostic character and continue, e.g. [U+hhhh] There are two INI settings that control this. unicode.from_error_mode = U_INVALID_SUBSTITUTE U_INVALID_SKIP U_INVALID_STOP U_INVALID_ESCAPE unicode.from_error_subst_char = a2 The second setting is supposed to contain the Unicode code point value for the substitution character. This value has to be representable in the target encoding. Note that PHP always tries to convert as much as of the data as possible and returns the converted results even if an error happens. Unicode Switch ============== Obviously, PHP cannot simply impose new Unicode support on everyone. There are many applications that do not care about Unicode and do not need it. Consequently, there is a switch that enables certain fundamental language changes related to Unicode. This switch is available as a site-wide, or per-dir INI setting only. Note that having switch turned off does not imply that PHP is unaware of Unicode at all and that no Unicode string can exist. It only affects certain aspects of the language, and Unicode strings can always be created programmatically. unicode_semantics = On [TODO: list areas that are affected by this switch] Unicode String Type =================== Unicode string type (IS_UNICODE) is supposed to contain text data encoded in UTF-16 format. It is the main string type in PHP when Unicode semantics switch is turned on. Unicode strings can exist when the switch is off, but they have to be produced programmatically, via calls to functions that return Unicode type. The operational unit when working with Unicode strings is a code point, not code unit or byte. One code point in UTF-16 may be comprised of 1 or 2 code units, each of which is a 16-bit word. Working on the code point level is necessary because doing otherwise would mean offloading the processing of surrogate pairs onto PHP users, and that is less than desirable. The repercussions are that one cannot expect code point N to be at offset N in the Unicode string. Instead, one has to iterate from the beginning from the string using U16_FWD() macro until the desired codepoint is reached. The codepoint access is one of the primary areas targeted for optimization. Native Encoding String Type =========================== Native encoding string type (IS_STRING) serves two purposes: backwards compatibility when Unicode semantics switch is off, and for representing strings in non-Unicode encodings (native encodings) when it is on. It is processsed on the byte level. Binary String Type ================== Binary string type (IS_BINARY) can be used for storing images, PDFs, or other binary data intended to be processed on a byte-level and that cannot be intepreted as text. Binary data type does not participate in implicit conversions, and cannot be explicitly upconverted to other string types, although the inverse is possible. Printing binary data to the standard output passes it through as-is, independent of the output encoding. When Unicode semantics switch is off, binary string literals and binary strings returned by functions actually resolve to IS_STRING type, for backwards compatibility reasons. Zval Structure Changes ====================== PHP is a type-agnostic language. Its data values are encapsulated in a zval (Zend value) structure that can change as necessary to accomodate various types. struct _zval_struct { /* Variable information */ union { long lval; /* long value */ double dval; /* double value */ struct { char *val; int len; } str; /* string value */ HashTable *ht; /* hash table value */ zend_object_value obj; /* object value */ } value; zend_uint refcount; zend_uchar type; /* active type */ zend_uchar is_ref; }; The type field determines what is stored in the union, IS_STRING being the only data type pertinent to this discussion. In the current version, the strings are binary-safe, but, for all intents and purposes, are assumed to be comprised of 8-bit characters. It is possible to treat the string value as an opaque type containing arbitrary binary data, and in fact that is how mbstring extension uses it, in order to store multibyte strings. However, many extensions and the Zend engine itself manipulate the string value directly without regard to its internals. Needless to say, this can lead to problems. For IS_UNICODE type, we need to add another structure to the union: union { .... struct { UChar *val; /* Unicode string value */ int32_t len; /* number of UChar's */ .... } value; This cleanly separates the two types of strings and helps preserve backwards compatibility. For IS_BINARY type, we can re-use the str union. Language Modifications ====================== If a Unicode switch is turned on, PHP string literals - single-quoted, double-quoted, and heredocs - become Unicode strings (IS_UNICODE type). They support all the same escape sequences and variable interpolations as previously, with the addition of some new escape sequences. The contents of the strings are interpreted as follows: - all non-escaped characters are interpreted as a corresponding Unicode codepoint based on the current script encoding, e.g. ASCII 'a' (0x51) => U+0061, Shift-JIS (0x92 0x69) => U+4E2D - existing PHP escape sequences are also interpreted as Unicode codepoints, including \xXX (hex) and \OOO (octal) numbers, e.g. "\x20" => U+0020 - two new escape sequences, \uXXXX and \UXXXXXX are interpreted as a 4 or 6-hex Unicode codepoint value, e.g. \u0221 => U+0221, \U010410 => U+10410 - a new escape sequence allows specifying a character by its full Unicode name, e.g. \C{THAI CHARACTER PHO SAMPHAO} => U+0E20 The single-quoted string is more restrictive than the other two types: so far the only escape sequence allowed inside of it was \', which specifies a literal single quote. However, single quoted strings now support the new Unicode character escape sequences as well. PHP allows variable interpolation inside the double-quoted and heredoc strings. However, the parser separates the string into literal and variable chunks during compilation, e.g. "abc $var def" -> "abc" . $var . "def". This means that the literal chunks can be handled in the normal way for as far as Unicode support is concerned. Since all string literals become Unicode by default, one loses the ability to specify byte-oriented or binary strings. In order to create binary string literals, a new syntax is necessary: prefixing a string literal with letter 'b' creates a binary string. $var = b'abc\001'; $var = b"abc\001"; $var = b<< IS_UNICODE uses runtime-encoding IS_UNICODE -> IS_BINARY converts to runtime encoding first, then to binary Implementation Details That Need Expanding ========================================== - Streams support for Unicode - What stream filters will we be providing? - Conversion errors behavior - Need to define the default. - INI files encoding - Do we support BOMs? - There are likely to be other issues which are missing from this document Build System ============ Unicode support in PHP is always enabled. The only configuration option during development should be the location of the ICU headers and libraries. --with-icu-dir= parameter specifies the location of ICU header and library files. After the initial development we have to repackage ICU library for our needs and bundle it with PHP. Document History ================ 0.5: Updated per latest discussions. Removed tentative language in several places, since we have decided on everything described here already. Clarified details according to Phase II progress. 0.4: Updated to include all the latest discussions. Updated development phases. 0.3: Updated to include all the latest discussions. 0.2: Updated Phase I design proposal per discussion on unicode@php.net. Modified Internal Encoding section to contain only UTF-16 info.. Expanded Script Encoding section. Added Binary Data Type section. Amended Language Modifications section to describe string literals behavior. Amended Build System section. 0.1: Phase I design proposal References ========== Unicode http://www.unicode.org Unicode Glossary http://www.unicode.org/glossary/ UTF-8 http://www.utf-8.com/ UTF-16 http://www.ietf.org/rfc/rfc2781.txt ICU Homepage http://www.ibm.com/software/globalization/icu/ ICU User Guide and API Reference http://icu.sourceforge.net/ Unicode Annex #31 http://www.unicode.org/reports/tr31/ PHP Parameter Parsing API http://www.php.net/manual/en/zend.arguments.retrieval.php Authors ======= Andrei Zmievski vim: set et :