php-src

mirror of https://github.com/php/php-src.git synced 2024-09-22 02:17:32 +00:00

Author	SHA1	Message	Date
Ayesh Karunaratne	421ac9ac28	ext/mbstring: update to Unicode 15 Updates UCD to Unicode 15.1 (released 2023 Sept). The upcoming Unicode 16 version will be released roughly on 2024 Sept. Previously: `0fdffc18`, #7502 UCD 15.1 `DerivedNormalizationProps` contains multiple properties in the same line, which breaks the parser. This also updates the `ucgendat.php` script to allow 2 or three fields in each line, and to look for the `Cased` and `Case_Ignorable` properties in either of the fields to mimic the previous behavior.	2024-06-29 17:24:52 +02:00
Colin O'Dell	fe36b81d5e	Update Unicode tables to 14.0.0 Closes GH-7502.	2021-09-20 09:58:20 +02:00
Nikita Popov	425c2e3ba1	Combine control into one character group Same as with punct, we're currently not interested in distinguishing between Cc and Cf, so only store their union.	2021-08-24 20:39:16 +02:00
Nikita Popov	f458b16041	Combine punctuation into one character group We're not currently interested in distinguishing between individual punctuation types, so just merge everything into one general category to make the property lookup more efficient.	2021-08-24 19:21:21 +02:00
Nikita Popov	3be94217f4	Don't use sentinel value for unicode property lookup 0xffff was used to mark character properties without any members. This made the code unnecessarily complicated, because we need to check for 0xffff values when looking up the property ranges. We can simply encode this as an empty set of ranges.	2021-08-24 15:53:43 +02:00
Alex Dowad	d8c785b894	Update 'East Asian Width' table to comply with Unicode 13.0 Instead of manually maintaining the data in eaw_table.h, it is now automatically generated by ucgendat/ucgendat.php, using the EastAsianWidth.txt file from the Unicode Consortium. Something must be said about the deleted test case. Back in 2004, someone noticed that `mb_strwidth` didn't comply with Unicode 4.0. A test case was added to expose the problem. Well, time keeps moving on, and with the changing years, new Unicodes are born and old Unicodes die. Some characters which were counted as double-width in Unicode 4.0 are no longer such in Unicode 13.0, which renders the test case obsolete. At the same time, make a couple of spelling/grammar fixes in ucgendat.php.	2021-01-19 20:38:44 +02:00
Nikita Popov	0fdffc1807	Update Unicode tables to 13.0.0	2020-03-12 11:29:51 +01:00
Nikita Popov	f2be6e732a	Update data tables for Unicode 11	2018-06-11 20:25:37 +02:00
Gabriel Caruso	2238403892	Trailing whitespaces on ext/* Signed-off-by: Gabriel Caruso <carusogabriel34@gmail.com>	2018-01-04 02:38:32 -02:00
Nikita Popov	f4a1d9c821	Fixed bug #65544 and #71298	2017-07-28 14:57:08 +02:00
Nikita Popov	582a65b06f	Implement full case mapping Implement full case mapping according to SpecialCasing.txt and also full case folding according to CaseFolding.txt (F). There are a number of caveats: * Only language-agnostic and unconditional full case mapping is implemented. The only language-agnostic conditional case mapping rule relates to Greek sigma in final position (Final_Sigma). Correctly handling this requires both arbitrary lookahead and lookbehind, which would require some larger changes to how the case mapping is implemented. This is a possible future extension. * The only language-specific handling that is implemented is for Turkish dotted/undotted Is, if the ISO-8859-9 encoding is used. This matches the previous behavior and makes sure that no codepoints not supported by the encoding are produced. A future extension would be to also handle the Turkish mappings specified by SpecialCasing.txt based on the mbfl internal language. * Full case folding is implemented, but case-insensitive mb_* operations continue to use simple case folding. The reason is that full case folding of the haystack string may change the position at which a match occurred. This would have to be mapped back into the position in the original string. * mb_convert_case() exposes both the full and the simple case mapping / folding, where full is the default. The constants are: * MB_CASE_LOWER (used by mb_strtolower) * MB_CASE_UPPER (used by mb_strtolower) * MB_CASE_TITLE * MB_CASE_FOLD * MB_CASE_LOWER_SIMPLE * MB_CASE_UPPER_SIMPLE * MB_CASE_TITLE_SIMPLE * MB_CASE_FOLD_SIMPLE (used by case-insensitive operations)	2017-07-28 12:32:50 +02:00
Nikita Popov	9ac7c1e71d	Use case-folding for case insensitive comparisons Instead of using lowercasing.	2017-07-28 12:32:50 +02:00
Nikita Popov	80a0601fe5	Use MPH for case maps Instead of performing a binary search, use a hashtable to store the case maps. In particular a minimal perfect hash construction is used, which does not require collision resolution (but does use an auxiliary table for the hash perturbation).	2017-07-28 12:32:50 +02:00
Nikita Popov	eacd70f762	Don't store titlecase if same as uppercase The totitle code already has a fallback for that case.	2017-07-28 12:32:50 +02:00
Nikita Popov	cedfc2f426	Drop implementation-specific character properties No point in keeping around non-standard character properties if we're not using them and most are not even being populated.	2017-07-28 12:32:50 +02:00
Nikita Popov	8ace7045e9	Handle character ranges in ucgendat generically In particular, the previous implementation did not account for Tangut Ideographs and CJK Ideograph extensions C through F.	2017-07-25 18:48:12 +02:00
Nikita Popov	4bd61ec7ad	Fix handling of some special ranges in ucgendat * Han Ideagraphs go up to U+9FEA. * CJK Compatibility Ideographs are no longer specified as a special range in remotely recent versions of Unicode. * Surrogate properties should be assigned to U+D800-U+DFFF, not to U+10000-U+1FFFF.	2017-07-25 18:48:12 +02:00
Nikita Popov	3c6b2512cb	Change layout of case mapping table Previously the case mapping table was segregated by the type of the character (upper, lower, title) and always stored the other two variants (key, other1, other2). Now the table is segregated by the target type (key, other). As only very few characters have more than one target this only slightly increases the size of the table. The advantage of this layout is that we only need to perform a single table lookup in the case table. Previously, depending on the case that was hit, either one lookup in the property table, or two lookups in the property table and one lookup in the case table were required. This changes the layout from libunicode in the OpenLDAP project -- however, the last commit there was over 10 years ago, so I don't see value in keeping this in sync.	2017-07-23 18:33:15 +02:00
Nikita Popov	706f0cf8a0	Update Unicode data for Unicode 10	2017-07-23 16:05:39 +02:00
Nikita Popov	24cfbfd56f	Update ucgendat for more bidi properties Handle them the same way as others -- by classifying as Other Neutral.	2017-07-23 16:03:11 +02:00
Nikita Popov	077e61fad3	Fixed bug #69267 completely ucgendat.c was assuming that a title-case character is a character that has both lower and upper-case variants. However, there are title-case characters that only have a lower-case variant. Use the Lt general character proprety to determine where in the case map the character should be placed instead.	2017-07-23 15:30:17 +02:00
Nikita Popov	0e4af9192f	Partial fix for bug #69267 This pulls in 60a25c72ba389f53b0621ca250bc99f3b295d43f from the OpenLDAP project.	2017-07-23 14:47:21 +02:00
Xinchen Hui	e841016df7	Upgrade unicode_data.h to UnicodeData.txt 8.0.0 (part of bug #70475 ext/mbstring/unicode_data.h needs update)	2015-09-15 07:56:10 -07:00
Stanislav Malyshev	b7a7b1a624	trailing whitespace removal	2015-01-10 15:07:38 -08:00
Gustavo André dos Santos Lopes	42dae97fd4	- Fixed bug #52981 (Unicode casing table was out-of-date). Updated with UnicodeData-6.0.0d7.txt and included the source of the generator program with the distribution. #The replaced tables, generated circa 2002, seem to reflect #Unicode 3.2. I was unable to generate the same property #offsets with Unicode 3.2 data, but all the tests I made #indicate php_unicode_is_prop() is returning the correct #values. The replaced file merely says it used a "modified #version" of ucgendat, which is not very helpful. The results #I got were not significantly different, only slightly higher #offsets at two properties, which were carried over to the #subsequent properties. #I was, however, able to replicate precisely the casing table. #The extent of the "modifications" besides omitting most of #the tables, a slightly different layout and the casing table #offsets having been multiplied by 3 is unclear. #The test suite showed no regressions; however, it's very poor #in testing the modified portion of the extension.	2010-10-05 01:54:17 +00:00
Wez Furlong	1a87c6b5bf	(PHP mb_convert_case) Add function that will convert the case of a string Respecting it's encoding (or the internal encoding).	2002-09-26 00:53:47 +00:00

26 Commits