php-src/NEWS
Alex Dowad 0e7160b836 Implement mb_detect_encoding using fast text conversion filters
Regarding the optional 3rd `strict` argument to mb_detect_encoding,
the documentation states:

  Controls the behaviour when string is not valid in any of the listed encodings.
  If strict is set to false, the closest matching encoding will be returned;
  if strict is set to true, false will be returned.

(Ref: https://www.php.net/manual/en/function.mb-detect-encoding.php)

Because of bugs in the implementation, mb_detect_encoding did not always
behave according to this description when `strict` was false.
For example:

  <?php
  echo var_export(mb_detect_encoding("\xc0\x00", "UTF-8", false));
  // Before this commit, prints: false
  // After this commit, prints: 'UTF-8'

Because `strict` is false in the above example, mb_detect_encoding
should return the 'closest matching encoding', which is UTF-8, since
that is the only candidate encoding. (Incidentally, this example shows
that using mb_detect_encoding with a single candidate encoding in
non-strict mode is useless.)

The new implementation fixes this bug. It also fixes another problem
with the old implementation as regards non-strict detection mode:

The old implementation would stop processing of the input string using
a particular candidate encoding as soon as it saw an error in that
encoding, even in non-strict mode. This means that it could not really
detect the 'closest matching encoding'; rather, what it would return
in non-strict mode was 'the encoding in which the first decoding error
is furthest from the beginning of the input string'.

In non-strict mode, the new implementation continues trying to process
the input string to its end even after seeing an error. This makes it
possible to determine in which candidate encoding the string has the
smallest number of errors, i.e. the 'closest matching encoding'.

Rejecting candidate encodings as soon as it saw an error gave the old
implementation a marked performance advantage in non-strict mode;
however, the new implementation still beats it in most cases. Here are
a few sample microbenchmark results:

  UTF-8, ~100 codepoints, strict mode
  Old: 0.080s (100,000 calls)
  New: 0.026s ("       "    )

  UTF-8, ~100 codepoints, non-strict mode
  Old: 0.079s (100,000 calls)
  New: 0.033s ("       "    )

  UTF-8, ~10000 codepoints, strict mode
  Old: 6.708s (60,000 calls)
  New: 1.383s ("      "    )

  UTF-8, ~10000 codepoints, non-strict mode
  Old: 6.705s (60,000 calls)
  New: 3.044s ("      "    )

Notice that the old implementation had almost identical performance
between strict and non-strict mode, while the new suffers a significant
performance penalty for non-strict detection. This is the cost of
implementing the behavior specified in the documentation.

A couple more sample results:

  SJIS, ~10000 codepoints, strict mode
  Old: 4.563s
  New: 1.084s

  SJIS, ~10000 codepoints, non-strict mode
  Old: 4.569s
  New: 2.863s

This is the only case I found where the new implementation loses:

  UTF-16LE, ~10000 codepoints, non-strict mode
  Old: 1.514s
  New: 2.813s

The reason is because the test strings happened to be invalid right from
the first few bytes for all the candidate encodings except for UTF-16LE;
so the old implementation would immediately reject all those encodings
and only process the entire string in UTF-16LE.

I believe mb_detect_encoding could be made much faster if we identified
good criteria for when to reject candidate encodings before reaching
the end of the input string.
2023-01-03 09:10:10 +02:00

90 lines
3.3 KiB
Plaintext

PHP NEWS
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
?? ??? ????, PHP 8.3.0alpha1
- CLI:
. Added pdeathsig to builtin server to terminate workers when the master
process is killed. (ilutov)
- Core:
. Fixed bug GH-9388 (Improve unset property and __get type incompatibility
error message). (ilutov)
. SA_ONSTACK is now set for signal handlers to be friendlier to other
in-process code such as Go's cgo. (Kévin Dunglas)
. SA_ONSTACK is now set when signals are disabled. (Kévin Dunglas)
. Fix GH-9649: Signal handlers now do a no-op instead of crashing when
executed on threads not managed by TSRM. (Kévin Dunglas)
. Fixed potential NULL pointer dereference Windows shm*() functions. (cmb)
. Added shadow stack support for fibers. (Chen Hu)
. Fix bug GH-9965 (Fix accidental caching of default arguments with side
effects). (ilutov)
- Fileinfo:
. Upgrade bundled libmagic to 5.43. (Anatol)
- FPM:
. The status.listen shared pool now uses the same php_values (including
expose_php) and php_admin_value as the pool it is shared with. (dwxh)
- GD:
. Fixed bug #81739: OOB read due to insufficient input validation in
imageloadfont(). (CVE-2022-31630) (cmb)
- Hash:
. Fixed bug #81738: buffer overflow in hash_update() on long parameter.
(CVE-2022-37454) (nicky at mouha dot be)
- Intl:
. Added pattern format error infos for numfmt_set_pattern. (David Carlier)
- JSON:
. Added json_validate(). (Juan Morales)
- MBString:
. mb_detect_encoding is better able to identify the correct encoding for Turkish text. (Alex Dowad)
. mb_detect_encoding's "non-strict" mode now behaves as described in the
documentation. Previously, it would return false if the very first byte
of the input string was invalid in all candidate encodings. (Alex Dowad)
- Opcache:
. Added start, restart and force restart time to opcache's
phpinfo section. (Mikhail Galanin)
. Fix GH-9139: Allow FFI in opcache.preload when opcache.preload_user=root.
(Arnaud, Kapitan Oczywisty)
. Made opcache.preload_user always optional in the cli and phpdbg SAPIs.
(Arnaud)
. Allows W/X bits on page creation on FreeBSD despite system settings.
(David Carlier)
- PCNTL:
. SA_ONSTACK is now set for pcntl_signal. (Kévin Dunglas)
. Added SIGINFO constant. (David Carlier)
- Posix:
. Added posix_sysconf. (David Carlier)
- Random:
. Added Randomizer::getBytesFromString(). (Joshua Rüsweg)
. Added Randomizer::nextFloat(), ::getFloat(), and IntervalBoundary. (timwolla)
- Reflection:
. Fix GH-9470 (ReflectionMethod constructor should not find private parent
method). (ilutov)
- Sockets:
. Added SO_ATTACH_REUSEPORT_CBPF socket option, to give tighter control
over socket binding for a cpu core. (David Carlier)
. Added SKF_AD_QUEUE for cbpf filters. (David Carlier)
. Added socket_atmark if send/recv needs using MSG_OOB. (David Carlier)
. Added TCP_QUICKACK constant, to give tigher control over
ACK delays. (David Carlier)
- Standard:
. E_NOTICEs emitted by unserialize() have been promoted to E_WARNING. (timwolla)
- Streams:
. Fixed bug #51056: blocking fread() will block even if data is available.
(Jakub Zelenka)
<<< NOTE: Insert NEWS from last stable release here prior to actual release! >>>