php-src/README.UNICODE

645 lines
23 KiB
Plaintext

Audience
========
This README describes how PHP 6 provides native support for the Unicode
Standard. Readers of this document should be proficient with PHP and have a
basic understanding of Unicode concepts. For more technical details about
PHP 6 design principles and for guidelines about writing Unicode-ready PHP
extensions, refer to README.UNICODE-UPGRADES.
Introduction
============
As successful as PHP has proven to be over the years, its support for
multilingual and multinational environments has languished. PHP can no
longer afford to remain outside the overall movement towards the Unicode
standard. Although recent updates involving the mbstring extension have
enabled easier multibyte data processing, this does not constitute native
Unicode support.
Since the full implementation of the Unicode Standard is very involved, our
approach is to speed up implementation by using the well-tested,
full-featured, and freely available ICU (International Components for
Unicode) library.
General Remarks
===============
International Components for Unicode
------------------------------------
ICU (International Components for Unicode is a mature, widely used set of
C/C++ and Java libraries for Unicode support, software internationalization
and globalization. It provides:
- Encoding conversions
- Collations
- Unicode text processing
- and much more
When building PHP 6, Unicode support is always enabled. The only
configuration option during development should be the location of the ICU
headers and libraries.
--with-icu-dir=<dir>
where <dir> specifies the location of ICU header and library files. If you do
not specify this option, PHP attempts to find ICU under /usr and /usr/local.
NOTE: ICU is not bundled with PHP 6 yet. To download the distribution, visit
http://icu.sourceforge.net. PHP requires ICU version 3.4 or higher.
Backwards Compatibility
-----------------------
Our paramount concern for providing Unicode support is backwards compatibility.
Because PHP is used on so many sites, existing data types and functions must
work as they always have. However, although PHP's interfaces must remain
backwards-compatible, the speed of certain operations might be affected due to
internal implementation changes.
Encoding Names
--------------
All the encoding settings discussed in this document can accept any valid
encoding name supported by ICU. For a full list of encodings, refer to the ICU
online documentation.
NOTE: References to "Unicode" in this document generally mean the UTF-16
character encoding, unless explicitly stated otherwise.
Unicode Semantics Switch
========================
Because many applications do not require Unicode, PHP 6 provides a server-wide
INI setting to enable Unicode support:
unicode.semantics = On/Off
This switch is off by default. If your applications do not require native
Unicode support, you may leave this switch off, and continue to use Unicode
strings only when you need to.
However, if your application is ready to fully support Unicode, you should
turn this switch on. This activates various Unicode support mechanisms,
including:
* All string literals become Unicode
* All variables received from HTTP requests become Unicode
* PHP identifiers may use Unicode characters
More fundamentally, your PHP environment is now a Unicode environment. Strings
inside PHP are Unicode, and the system is responsible for converting non-Unicode
strings on PHP's periphery (for example, in HTTP input and output, streams, and
filesystem operations). With unicode.semantics on, you must specify binary
strings explicitly. PHP makes no assumptions about the content of a binary
string, so your application must handle all binary string appropriately.
Conversely, if unicode.semantics is off, PHP behaves as it did in the past.
String literals do not become Unicode, and files are binary strings for
backwards compatibility. You can always create Unicode strings programmatically,
and all functions and operators support Unicode strings transparently.
Fallback Encoding
=================
The fallback encoding provides a default value for all other unicode.*_encoding
INI settings. If you do not set a particular unicode.*_encoding setting, PHP
uses the fallback encoding. If you do not specify a fallback encoding, PHP uses
UTF-8.
unicode.fallback_encoding = "iso-8859-1"
Runtime Encoding
================
The runtime encoding specifies the encoding PHP uses for converting binary
strings within the PHP engine itself.
unicode.runtime_encoding = "iso-8859-1"
This setting has no effect on I/O-related operations such as writing to
standard out, reading from the filesystem, or decoding HTTP input variables.
PHP enables you to explicitly convert strings using casting:
* (binary) -- casts to binary string type
* (unicode) -- casts to Unicode string type
* (string) -- casts to Unicode string type if unicode.semantics is on,
to binary otherwise
For example, if unicode.runtime_encoding is iso-8859-1, and $uni is a unicode
string, then
$str = (binary)$uni
creates a binary string $str in the ISO-8859-1 encoding.
Implicit conversions include concatenation, comparison, and parameter passing.
For better precision, PHP attempts to convert strings to Unicode before
performing these sorts of operations. For example, if we concatenate our binary
string $str with a unicode literal, PHP converts $str to Unicode first, using
the encoding specified by unicode.runtime_encoding.
Output Encoding
===============
PHP automatically converts output for commands that write to the standard
output stream, such as 'print' and 'echo'.
unicode.output_encoding = "utf-8"
However, PHP does not convert binary strings. When writing to files or external
resources, you must rely on stream encoding features or manually encode the data
using functions provided by the unicode extension.
The existing default_charset INI setting is DEPRECATED in favor of
unicode.output_setting. Previously, default_charset only specified the charset
portion of the Content-Type MIME header. Now default_charset only takes effect
when unicode.semantics is off, and it does not affect the actual transcoding of
the output stream. Setting unicode.output_encoding causes PHP to add the
'charset' portion to the Content-Type header, overriding any value set for
default_charset.
HTTP Input Encoding
===================
The HTTP input encoding specifies the encoding of variables received via
HTTP, such as the contents of the $_GET and _$POST arrays.
This functionality is currently under development. For a discussion of the
approach that the PHP 6 team is taking, refer to:
http://marc.theaimsgroup.com/?t=116613047300005&r=1&w=2
Filesystem Encoding
===================
The filesystem encoding specifies the encoding of file and directory names
on the filesystem.
unicode.filename_encoding = "utf-8"
Filesystem-related functions such as opendir() perform this conversion when
accepting and returning file names. You should set the filename encoding to
the encoding used by your filesystem.
Script Encoding
===============
You may write PHP scripts in any encoding supported by ICU. To specify the
script encoding site-wide, use the INI setting:
unicode.script_encoding = utf-8
If you cannot change the encoding system wide, you can use a pragma to
override the INI setting in a local script:
<?php declare(encoding = 'Shift-JIS'); ?>
The pragma setting must be the first statement in the script. It only affects
the script in which it occurs, and does not propagate to any included files.
INI Files
=========
If unicode.semantics is on, INI files are presumed to contain UTF-8 encoded
keys and values. If unicode.semantics is off, the data is taken as-is,
similar to PHP 5. No validation occurs during parsing. Instead invalid UTF-8
sequences are caught during access by ini_*() functions.
Stream I/O
==========
PHP has a streams-based I/O system for generalized filesystem access,
networking, data compression, and other operations. Since the data on the
other end of the stream can be in any encoding, you need to think about
data conversion.
Okay, this needs to be clarified. By "default", streams are actually
opened in binary mode. You have to specify 't' flag or use FILE_TEXT in
order to open it in text mode, where conversions apply. And for the text
mode streams, the default stream encoding is UTF-8 indeed.
By default, PHP opens streams in binary mode. To open a file in text mode,
you must use the 't' flag (or the FILE_TEXT parameter -- see below). The
default encoding for streams in text mode is UTF-8. This means that if
'file.txt' is a UTF-8 text file, this code snippet:
$fp = fopen('file.txt', 'rt');
$str = fread($fp, 100)
returns 100 Unicode characters, while:
$fp = fopen('file.txt', 'wt');
$fwrite($fp, $uni)
writes to a UTF-8 text file.
If you mainly work with files in an encoding other than UTF-8, you can
change the default context encoding setting:
stream_default_encoding('Shift-JIS');
$data = file_get_contents('file.txt', FILE_TEXT);
// work on $data
file_put_contents('file.txt', $data, FILE_TEXT);
The file_get_contents() and file_put_contents() functions now accept an
additional parameter, FILE_TEXT. If you provide FILE_TEXT for
file_get_contents(), PHP returns a Unicode string. Without FILE_TEXT, PHP
returns a binary string (which would be appropriate for true binary data, such
as an image file). When writing a Unicode string with file_put_contents(), you
must supply the FILE_TEXT parameter, or PHP generates a warning.
If you need to work with multiple encodings, you can create custom contexts
using stream_context_create() and then pass in the custom context as an
additional parameter. For example:
$ctx = stream_context_create(NULL, array('encoding' => 'big5'));
$data = file_get_contents('file.txt', FILE_TEXT, $ctx);
// work on $data
file_put_contents('file.txt', $data, FILE_TEXT, $ctx);
Conversion Semantics and Error Handling
=======================================
PHP can convert strings explicitly (casting) and implicitly (concatenation,
comparison, and parameter passing. For example, when concatenating a Unicode
string and a binary string, PHP converts the binary string to Unicode for better
precision.
However, not all characters can be converted between Unicode and legacy
encodings. The first possibility is that a string contains corrupt data or
an illegal byte sequence. In this case, the converter simply stops with
a message that resembles:
Warning: Could not convert binary string to Unicode string
(converter UTF-8 failed on bytes (0xE9) at offset 2)
Conversely, if a similar error occurs when attempting to convert Unicode to
a legacy string, the converter generates a message that resembles:
Warning: Could not convert Unicode string to binary string
(converter ISO-8859-1 failed on character {U+DC00} at offset 2)
To customize this behavior, refer to "Creating a Custom Error Handler" below.
The second possibility is that a Unicode character simply cannot be represented
in the legacy encoding. By default, when downconverting from Unicode, the
converter substitutes any missing sequences with the appropriate substitution
sequence for that codepage, such as 0x1A (Control-Z) in ISO-8859-1. When
upconverting to Unicode, the converter replaces any byte sequence that has no
Unicode equivalent with the Unicode substitution character (U+FFFD).
You can customize the conversion error behavior to:
- stop the conversion and return an empty string
- skip any invalid characters
- substibute invalid characters with a custom substitution character
- escape the invalid character in various formats
To control the global conversion error settings, use the functions:
unicode_set_error_mode(int direction, int mode)
unicode_set_subst_char(unicode char)
where direction is either FROM_UNICODE or TO_UNICODE, and mode is one of these
constants:
U_CONV_ERROR_STOP
U_CONV_ERROR_SKIP
U_CONV_ERROR_SUBST
U_CONV_ERROR_ESCAPE_UNICODE
U_CONV_ERROR_ESCAPE_ICU
U_CONV_ERROR_ESCAPE_JAVA
U_CONV_ERROR_ESCAPE_XML_DEC
U_CONV_ERROR_ESCAPE_XML_HEX
As an example, with a runtime encoding of ISO-8859-1, the conversion:
$str = (binary)"< \u30AB >";
results in:
MODE RESULT
--------------------------------------
stop ""
skip "< >"
substitute "< ? >"
escape (Unicode) "< {U+30AB} >"
escape (ICU) "< %U30AB >"
escape (Java) "< \u30AB >"
escape (XML decimal) "< &#12459; >"
escape (XML hex) "< &#x30AB; >"
With a runtime encoding of UTF-8, the conversion of the (illegal) sequence:
$str = (unicode)b"< \xe9\xfe >";
results in:
MODE RESULT
--------------------------------------
stop ""
skip ""
substitute ""
escape (Unicode) "< %XE9%XFE >"
escape (ICU) "< %XE9%XFE >"
escape (Java) "< \xE9\xFE >"
escape (XML decimal) "< &#233;&#254; >"
escape (XML hex) "< &#xE9;&#xFE; >"
The substitution character can be set only for FROM_UNICODE direction and has to
exist in the target character set. The default substitution character is (?).
NOTE: Casting is just a shortcut for using unicode.runtime_encoding. To convert
using an alternative encoding, use the unicode_encode() and unicode_decode()
functions. For example,
$str = unicode_encode($uni, 'koi8-r', U_CONV_ERROR_SUBST);
results in a binary KOI8-R encoded string.
Creating a Custom Error Handler
-------------------------------
If an error occurs during the conversion, PHP outputs a warning describing the
problem. Instead of this default behavior, PHP can invoke a user-provided error
handler, similar to how the current user-defined error handler works. To set
the custom conversion error handler, call:
mixed unicode_set_error_handler(callback error_handler)
The function returns the previously defined custom error handler. If no error
handler was defined, or if an error occurs when returning the handler, this
function returns NULL.
When the custom handler is set, the standard error handler is bypassed. It is
the responsibility of the custom handler to output or log any messages, raise
exceptions, or die(), if necessary. However, if the custom error handler returns
FALSE, the standard handler will be invoked afterwards.
The user function specified as the error_handler must accept five parameters:
mixed error_handler($direction, $encoding, $char_or_byte, $offset,
$message)
where:
$direction - the direction of conversion, FROM_UNICODE/TO_UNICODE
$encoding - the name of the encoding to/from which the conversion
was attempted
$char_or_byte - either Unicode character or byte sequence (depending
on direction) which caused the error
$offset - the offset of the failed character/byte sequence in
the source string
$message - the error message describing the problem
NOTE: If the error mode set by unicode_set_error_mode() is substitute,
skip, or escape, the handler won't be called, since these are non-error
causing operations. To always invoke your handler, set the error mode to
U_CONV_ERROR_STOP.
Unicode String Type
===================
The Unicode string type (IS_UNICODE) is supposed to contain text data encoded in
UTF-16. This is the main string type in PHP when Unicode semantics switch is
turned on. Unicode strings can exist when the switch is off, but they have to be
produced programmatically via calls to functions that return Unicode types.
Binary String Type
==================
Binary string type (IS_STRING) serves two purposes: backwards compatibility and
representing non-Unicode strings and binary data. When Unicode semantics switch
is off, it is used for all strings in PHP, same in previous versions. When the
switch is on, this type will be used to store text in other encodings as well as
true binary data such as images, PDFs, etc.
Printing binary data to the standard output passes it through as-is, independent
of the output encoding.
For examples of specifying binary string literals, refer to the section
"Language Modfications".
Language Modifications
======================
If a Unicode switch is turned on, PHP string literals -- single-quoted,
double-quoted, and heredocs -- become Unicode strings (IS_UNICODE type). String
literals support all the same escape sequences and variable interpolations as
before, plus several new escape sequences.
PHP interprets the contents of strings as follows:
- all non-escaped characters are interpreted as a corresponding Unicode
codepoint based on the current script encoding, e.g. ASCII 'a' (0x61) =>
U+0061, Shift-JIS (0x92 0x86) => U+4E2D
- existing PHP escape sequences are also interpreted as Unicode codepoints,
including \xXX (hex) and \OOO (octal) numbers, e.g. "\x20" => U+0020
- two new escape sequences, \uXXXX and \UXXXXXX, are interpreted as a 4 or
6-hex Unicode codepoint value, e.g. \u0221 => U+0221, \U010410 =>
U+10410. (Having two sequences avoids the ambiguity of \u020608 --
is that supposed to be U+0206 followed by "08", or U+020608 ?)
- a new escape sequence allows specifying a character by its full
Unicode name, e.g. \C{THAI CHARACTER PHO SAMPHAO} => U+0E20
PHP allows variable interpolation inside the double-quoted and heredoc strings.
However, the parser separates the string into literal and variable chunks during
compilation, e.g. "abc $var def" -> "abc" . $var . "def". This means that PHP
can handle literal chunks in the normal way as far as Unicode support is
concerned.
Since all string literals become Unicode by default, PHP 6 introduces new syntax
for creating byte-oriented or binary strings. Prefixing a string literal with
the letter 'b' creates a binary string:
$var = b'abc\001';
$var = b"abc\001";
$var = b<<<EOD
abc\001
EOD;
The content of a binary string is the literal byte sequence inside the
delimiters, which depends on the script encoding (unicode.script_encoding).
Binary string literals support the same escape sequences as PHP 5 strings. If
the Unicode switch is turned off, then the binary string literals generate the
normal string (IS_STRING) type internally without any effect on the application.
The string operators now accomodate the new IS_UNICODE and IS_BINARY types:
- The concatenation operator (.) and concatenation assignment operator (.=)
automatically coerce the IS_STRING type to the more precise IS_UNICODE if
the operands are of different string types.
- The string indexing operator [] now accommodates IS_UNICODE type strings
and extracts the specified character. To support supplementary characters,
the index specifies a code point, not a byte or a code unit.
- Bitwise operators and increment/decrement operators do not work on
Unicode strings. They do work on binary strings.
- Two new casting operators are introduced, (unicode) and (binary). The
(string) operator casts to Unicode type if the Unicode semantics switch is
on, and to binary type otherwise.
- The comparison operators compare Unicode strings in binary code point
order. They also coerce strings to Unicode if the strings are of different
types.
- The arithmetic operators use the same semantics as today for converting
strings to numbers. A Unicode string is considered numeric if it
represents a long or a double number in the en_US_POSIX locale.
Unicode Support in Existing Functions
=====================================
All functions in the PHP default distribution are undergoing analysis to
determine which functions need to be upgraded for native Unicode support.
You can track progress here:
http://www.php.net/~scoates/unicode/render_func_data.php
Key extensions that are fully converted include:
* curl
* dom
* json
* mysql
* mysqli
* oci8
* pcre
* reflection
* simplexml
* soap
* sqlite
* xml
* xmlreader/xmlwriter
* xsl
* zlib
NOTE: Unsafe functions might still work, since PHP performs Unicode conversions
at runtime. However, unsafe functions might not work correctly with multibyte
binary strings, or Unicode characters that are not representable in the
specified unicode.runtime_encoding.
Identifiers
===========
Since scripts may be written in various encodings, we do not restrict
identifiers to be ASCII-only. PHP allows any valid identifier based
on the Unicode Standard Annex #31.
Numbers
=======
Unlike identifiers, numbers must consist only of ASCII digits,.and are
restricted to the en_US_POSIX or C locale. In other words, numbers have no
thousands separator, and the fractional separator is (.) "full stop". Numeric
strings adhere to the same rules, so "10,3" is not interpreted as a number even
if the current locale's fractional separator is a comma.
TextIterators
=============
Instead of using the offset operator [] to access characters in a linear
fashion, use a TextIterator instead. TextIterator is very fast and enables you
to iterate over code points, combining sequences, characters, words, lines, and
sentences, both forward and backward. For example:
$text = "nai\u308ve";
foreach (new TextIterator($text) as $u) {
var_inspect($u)
}
lists six code points, including the umlaut (U+0308) as a separate code point.
Instantiating the TextIterator to iterate over characters,
$text = "nai\u308ve";
foreach (new TextIterator($text, TextIterator::CHARACTER) as $u) {
var_inspect($u)
}
lists five characters, including an "i" with an umlaut as a single character.
Locales
=======
Unicode support in PHP relies exclusively on ICU locales, NOT the POSIX locales
installed on the system. You may access the default ICU locale using:
locale_set_default()
locale_get_default()
ICU locale IDs have a somewhat different format from POSIX locale IDs. The ICU
syntax is:
<language>[_<script>]_<country>[_<variant>][@<keywords>]
For example, sr_Latn_YU_REVISED@currency=USD is Serbian (Latin, Yugoslavia,
Revised Orthography, Currency=US Dollar).
Do not use the deprecated setlocale() function. This function interacts with the
POSIX locale. If Unicode semantics are on, using setlocale() generates
a deprecation warning.
Document TODO
==========================================
- Final review.
- Fix the HTTP Input Encoding section, that's obsolete now.
References
==========
Unicode
http://www.unicode.org
Unicode Glossary
http://www.unicode.org/glossary/
UTF-8
http://www.utf-8.com/
UTF-16
http://www.ietf.org/rfc/rfc2781.txt
ICU Homepage
http://www.ibm.com/software/globalization/icu/
ICU User Guide and API Reference
http://icu.sourceforge.net/
Unicode Annex #31
http://www.unicode.org/reports/tr31/
PHP Parameter Parsing API
http://www.php.net/manual/en/zend.arguments.retrieval.php
Authors
=======
Andrei Zmievski <andrei@gravitonic.com>
Evan Goer <goer@yahoo-inc.com>
vim: set et tw=80 :