php-src/ext/mbstring/mbstring.h
Alex Dowad 3ab10da758 Take order of candidate encodings into account when guessing text encoding
The documentation for mb_detect_encoding says that this function
"Detects the most likely character encoding for string `string` from an
ordered list of candidates".

Prior to 28b346bc06, mb_detect_encoding did not really attempt to
determine the "most likely" text encoding for the input string. It
would just return the first candidate encoding for which the string was
valid. In 28b346bc06, I amended this function so that it uses heuristics
to try to guess which candidate encoding is "most likely".

However, the caller did not have any way to indicate which candidate
text encoding(s) they consider to be more likely, in case the
heuristics applied are inconclusive. In the language of Bayesian
probability, there was no way for the caller to indicate their 'prior'
assignment of probabilities.

Further, the documentation for mb_detect_encoding also says that the
second parameter `encodings` is "a list of character encodings to try,
in order". The documentation clearly implies that the order of
the `encodings` argument should be significant.

Therefore, amend mb_detect_encoding so that while it still uses
heuristics to guess the most likely text encoding for the input string,
it favors those which are earlier in the list of candidate encodings.

One complication is that many callers of mb_detect_encoding use it
in this way:

    mb_detect_encoding($string, mb_list_encodings());

In a majority of cases, this is bad code; mb_detect_encoding will both
be much slower and the results will be less reliable than if a smaller
list of candidates is used. However, since such code already exists and
people are using it in production, we should not unnecessarily break it.
The order of candidate encodings obviously does not express any prior
belief of which candidates are more likely in this case, and treating
it as if it did will degrade the accuracy of the result.

Since mb_list_encodings now returns a single, immutable array on each
call, we can avoid that problem by turning off the new behavior when
we receive the array of encodings returned by mb_list_encodings.
This implementation means that if the user does this:

    $a = mb_list_encodings();
    mb_detect_encoding($string, $a);

...then the order of candidate encodings will not be considered.
However, if the user explicitly initializes their own array of all
supported legacy text encodings, then the order *will* be considered.

The other functions which also follow this new behavior are:

• mb_convert_variables
• mb_convert_encoding (when multiple candidate input encodings are
  listed)

Other places where "detection" (or really "guessing") of text encoding
may be performed include:

• mb_send_mail
• Zend engine, when determining the encoding of a PHP script
• mbstring processing of HTTP request contents, when http_input INI
  parameter is set to a list

In these cases, the new logic based on order of candidate encodings
is *not* enabled. It *might* be logical to consider the order of
candidate encodings in some or all of these cases, but I'm not sure if
that is true, so it seems wiser to avoid more behavior changes than is
necessary. Further, ever since the new encoding detection heuristics
were implemented in 28b346bc06, we have not received any complaints of
user code being broken in these areas. So I am reluctant to "fix what
isn't broken".

Well, some might say that applying the new detection heuristics
to mb_send_mail, etc. in 28b346bc06 was "fixing what wasn't broken",
but (cough cough) I don't have any comment on that...
2023-05-16 07:01:07 -07:00

125 lines
4.8 KiB
C

/*
+----------------------------------------------------------------------+
| Copyright (c) The PHP Group |
+----------------------------------------------------------------------+
| This source file is subject to version 3.01 of the PHP license, |
| that is bundled with this package in the file LICENSE, and is |
| available through the world-wide-web at the following url: |
| https://www.php.net/license/3_01.txt |
| If you did not receive a copy of the PHP license and are unable to |
| obtain it through the world-wide-web, please send a note to |
| license@php.net so we can mail you a copy immediately. |
+----------------------------------------------------------------------+
| Author: Tsukada Takuya <tsukada@fminn.nagano.nagano.jp> |
| Hironori Sato <satoh@jpnnet.com> |
| Shigeru Kanemoto <sgk@happysize.co.jp> |
+----------------------------------------------------------------------+
*/
#ifndef _MBSTRING_H
#define _MBSTRING_H
#include "php_version.h"
#define PHP_MBSTRING_VERSION PHP_VERSION
#ifdef PHP_WIN32
# undef MBSTRING_API
# ifdef MBSTRING_EXPORTS
# define MBSTRING_API __declspec(dllexport)
# elif defined(COMPILE_DL_MBSTRING)
# define MBSTRING_API __declspec(dllimport)
# else
# define MBSTRING_API /* nothing special */
# endif
#elif defined(__GNUC__) && __GNUC__ >= 4
# undef MBSTRING_API
# define MBSTRING_API __attribute__ ((visibility("default")))
#else
# undef MBSTRING_API
# define MBSTRING_API /* nothing special */
#endif
#include "libmbfl/mbfl/mbfilter.h"
#include "SAPI.h"
#define PHP_MBSTRING_API 20021024
extern zend_module_entry mbstring_module_entry;
#define phpext_mbstring_ptr &mbstring_module_entry
PHP_MINIT_FUNCTION(mbstring);
PHP_MSHUTDOWN_FUNCTION(mbstring);
PHP_RINIT_FUNCTION(mbstring);
PHP_RSHUTDOWN_FUNCTION(mbstring);
PHP_MINFO_FUNCTION(mbstring);
MBSTRING_API char *php_mb_safe_strrchr(const char *s, unsigned int c, size_t nbytes, const mbfl_encoding *enc);
MBSTRING_API zend_string* php_mb_convert_encoding_ex(
const char *input, size_t length,
const mbfl_encoding *to_encoding, const mbfl_encoding *from_encoding);
MBSTRING_API zend_string* php_mb_convert_encoding(
const char *input, size_t length, const mbfl_encoding *to_encoding,
const mbfl_encoding **from_encodings, size_t num_from_encodings);
MBSTRING_API size_t php_mb_mbchar_bytes(const char *s, const mbfl_encoding *enc);
MBSTRING_API size_t php_mb_stripos(bool mode, zend_string *haystack, zend_string *needle, zend_long offset, const mbfl_encoding *enc);
MBSTRING_API bool php_mb_check_encoding(const char *input, size_t length, const mbfl_encoding *encoding);
MBSTRING_API const mbfl_encoding* mb_guess_encoding_for_strings(const unsigned char **strings, size_t *str_lengths, size_t n, const mbfl_encoding **elist, unsigned int elist_size, bool strict, bool order_significant);
ZEND_BEGIN_MODULE_GLOBALS(mbstring)
char *internal_encoding_name;
const mbfl_encoding *internal_encoding;
const mbfl_encoding *current_internal_encoding;
const mbfl_encoding *http_output_encoding;
const mbfl_encoding *current_http_output_encoding;
const mbfl_encoding *http_input_identify;
const mbfl_encoding *http_input_identify_get;
const mbfl_encoding *http_input_identify_post;
const mbfl_encoding *http_input_identify_cookie;
const mbfl_encoding *http_input_identify_string;
const mbfl_encoding **http_input_list;
size_t http_input_list_size;
const mbfl_encoding **detect_order_list;
size_t detect_order_list_size;
const mbfl_encoding **current_detect_order_list;
size_t current_detect_order_list_size;
enum mbfl_no_encoding *default_detect_order_list;
size_t default_detect_order_list_size;
HashTable *all_encodings_list;
int filter_illegal_mode;
uint32_t filter_illegal_substchar;
int current_filter_illegal_mode;
uint32_t current_filter_illegal_substchar;
enum mbfl_no_language language;
bool encoding_translation;
bool strict_detection;
size_t illegalchars;
bool outconv_enabled;
unsigned int outconv_state;
void *http_output_conv_mimetypes;
#ifdef HAVE_MBREGEX
struct _zend_mb_regex_globals *mb_regex_globals;
zend_long regex_stack_limit;
#endif
zend_string *last_used_encoding_name;
const mbfl_encoding *last_used_encoding;
/* Whether an explicit internal_encoding / http_output / http_input encoding was set. */
bool internal_encoding_set;
bool http_output_set;
bool http_input_set;
#ifdef HAVE_MBREGEX
zend_long regex_retry_limit;
#endif
ZEND_END_MODULE_GLOBALS(mbstring)
#define MBSTRG(v) ZEND_MODULE_GLOBALS_ACCESSOR(mbstring, v)
#if defined(ZTS) && defined(COMPILE_DL_MBSTRING)
ZEND_TSRMLS_CACHE_EXTERN()
#endif
#endif /* _MBSTRING_H */