php-src/ext/mbstring
Alex Dowad a1a69c3734 Support Microsoft's "Best Fit" mappings for Windows-1252 text encoding
In b5ff87ca71, I made a number of adjustments to our conversion code
for CP1252. One of the adjustments was to make the mappings match those
published by the Unicode Consortium in the file CP1252.TXT. These do
not include mappings for the CP1252 bytes 0x81, 0x8D, 0x8F, 0x90, and
0x9D.

Rostyslav Gulka reported that this caused a problem. His application
stores binary JPEG data in an MS-SQL database. When they SELECT the
binary data out of the database, it is treated as CP1252 text and
automatically converted to UTF-8. To recover the original binary
data, they then do a conversion from UTF-8 to CP1252.

Obviously, that does not work if certain CP1252 bytes do not map to
any Unicode codepoint at all.

While this is a very unusual application of text encoding conversion,
and we might choose not to support it if there was no other basis for
including those mappings, it seems that Microsoft does actually include
them in the Win32 API as "best fit" mappings. These are extra mappings
from Unicode to other text encodings, which the Win32 API function
WideCharToMultiByte uses by default unless the WC_NO_BEST_FIT_CHARS
flag was passed.

A list of these "best fit" mappings for CP1252 can be found here:

https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1252.txt
2022-12-09 15:18:37 +02:00
..
libmbfl Support Microsoft's "Best Fit" mappings for Windows-1252 text encoding 2022-12-09 15:18:37 +02:00
tests Support Microsoft's "Best Fit" mappings for Windows-1252 text encoding 2022-12-09 15:18:37 +02:00
ucgendat Combine control into one character group 2021-08-24 20:39:16 +02:00
common_codepoints.txt mb_detect_encoding recognizes all letters in Hungarian alphabet 2022-05-25 08:22:07 +02:00
config.m4 Remove duplicate implementation of CP932 from mbstring 2021-06-17 13:12:40 +02:00
config.w32 Remove duplicate implementation of CP932 from mbstring 2021-06-17 13:12:40 +02:00
CREDITS
gen_rare_cp_bitvec.php Improve detection accuracy of mb_detect_encoding 2021-10-19 18:05:51 +02:00
mb_gpc.c Update http->https in license (#6945) 2021-05-06 12:16:35 +02:00
mb_gpc.h Update http->https in license (#6945) 2021-05-06 12:16:35 +02:00
mbstring_arginfo.h Add support for generating MAY_BE_ARRAY_OF_REF func info flag (#7416) 2021-08-30 13:50:34 +02:00
mbstring.c Fix GH-9008: mb_detect_encoding(): wrong results with null $encodings 2022-07-20 16:58:55 +02:00
mbstring.h Update http->https in license (#6945) 2021-05-06 12:16:35 +02:00
mbstring.stub.php Add support for generating MAY_BE_ARRAY_OF_REF func info flag (#7416) 2021-08-30 13:50:34 +02:00
php_mbregex.c Update http->https in license (#6945) 2021-05-06 12:16:35 +02:00
php_mbregex.h Update http->https in license (#6945) 2021-05-06 12:16:35 +02:00
php_onig_compat.h
php_unicode.c Return bool from php_unicode_is_prop() 2021-08-24 19:21:21 +02:00
php_unicode.h Add comments to grouped character properties 2021-08-24 22:09:26 +02:00
rare_cp_bitvec.h mb_detect_encoding recognizes all letters in Hungarian alphabet 2022-05-25 08:22:07 +02:00
unicode_data.h Update Unicode tables to 14.0.0 2021-09-20 09:58:20 +02:00