php-src

mirror of https://github.com/php/php-src.git synced 2024-09-22 02:17:32 +00:00

Author	SHA1	Message	Date
Ayesh Karunaratne	23f99f08c9	ext/mbstring: update UCD parser to accept characters with multiple properties	2024-06-29 17:24:52 +02:00
Alex Dowad	0b32a15eb0	Optimize mb_str{,im}width for performance Rather than doing a linear search of a table of fullwidth codepoint ranges for every input character, 1) Short-cut the search if the codepoint is below the first such range 2) Otherwise, do a binary (rather than linear) search	2021-09-29 18:19:01 +02:00
Nikita Popov	425c2e3ba1	Combine control into one character group Same as with punct, we're currently not interested in distinguishing between Cc and Cf, so only store their union.	2021-08-24 20:39:16 +02:00
Nikita Popov	f458b16041	Combine punctuation into one character group We're not currently interested in distinguishing between individual punctuation types, so just merge everything into one general category to make the property lookup more efficient.	2021-08-24 19:21:21 +02:00
Nikita Popov	3be94217f4	Don't use sentinel value for unicode property lookup 0xffff was used to mark character properties without any members. This made the code unnecessarily complicated, because we need to check for 0xffff values when looking up the property ranges. We can simply encode this as an empty set of ranges.	2021-08-24 15:53:43 +02:00
Alex Dowad	d8c785b894	Update 'East Asian Width' table to comply with Unicode 13.0 Instead of manually maintaining the data in eaw_table.h, it is now automatically generated by ucgendat/ucgendat.php, using the EastAsianWidth.txt file from the Unicode Consortium. Something must be said about the deleted test case. Back in 2004, someone noticed that `mb_strwidth` didn't comply with Unicode 4.0. A test case was added to expose the problem. Well, time keeps moving on, and with the changing years, new Unicodes are born and old Unicodes die. Some characters which were counted as double-width in Unicode 4.0 are no longer such in Unicode 13.0, which renders the test case obsolete. At the same time, make a couple of spelling/grammar fixes in ucgendat.php.	2021-01-19 20:38:44 +02:00
Peter Kokot	975cb57930	[ci skip] Move OpenLDAP license to redistributable info file	2019-05-06 23:02:46 +02:00
Peter Kokot	36c7946522	Move ucgendata README to generator file header	2019-04-20 22:35:25 +02:00
Peter Kokot	1ad08256f3	Sync leading and final newlines in source code files This patch adds missing newlines, trims multiple redundant final newlines into a single one, and trims redundant leading newlines. According to POSIX, a line is a sequence of zero or more non-' <newline>' characters plus a terminating '<newline>' character. [1] Files should normally have at least one final newline character. C89 [2] and later standards [3] mention a final newline: "A source file that is not empty shall end in a new-line character, which shall not be immediately preceded by a backslash character." Although it is not mandatory for all files to have a final newline fixed, a more consistent and homogeneous approach brings less of commit differences issues and a better development experience in certain text editors and IDEs. [1] http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_206 [2] https://port70.net/~nsz/c/c89/c89-draft.html#2.1.1.2 [3] https://port70.net/~nsz/c/c99/n1256.html#5.1.1.2	2018-10-14 12:56:38 +02:00
Peter Kokot	37c329d715	Trim trailing whitespace in source code files	2018-10-13 14:17:28 +02:00
Peter Kokot	02294f0c84	Make PHP development tools files and scripts executable This patch makes several scripts and PHP development tools files executable and adds more proper shebangs to the PHP scripts. The `#!/usr/bin/env php` shebang provides running the script via `./script.php` and uses env to find PHP script location on the system. At the same time it still provides running the script with a user defined PHP location using `php script.php`.	2018-08-29 20:58:17 +02:00
Nikita Popov	f4a1d9c821	Fixed bug #65544 and #71298	2017-07-28 14:57:08 +02:00
Nikita Popov	582a65b06f	Implement full case mapping Implement full case mapping according to SpecialCasing.txt and also full case folding according to CaseFolding.txt (F). There are a number of caveats: * Only language-agnostic and unconditional full case mapping is implemented. The only language-agnostic conditional case mapping rule relates to Greek sigma in final position (Final_Sigma). Correctly handling this requires both arbitrary lookahead and lookbehind, which would require some larger changes to how the case mapping is implemented. This is a possible future extension. * The only language-specific handling that is implemented is for Turkish dotted/undotted Is, if the ISO-8859-9 encoding is used. This matches the previous behavior and makes sure that no codepoints not supported by the encoding are produced. A future extension would be to also handle the Turkish mappings specified by SpecialCasing.txt based on the mbfl internal language. * Full case folding is implemented, but case-insensitive mb_* operations continue to use simple case folding. The reason is that full case folding of the haystack string may change the position at which a match occurred. This would have to be mapped back into the position in the original string. * mb_convert_case() exposes both the full and the simple case mapping / folding, where full is the default. The constants are: * MB_CASE_LOWER (used by mb_strtolower) * MB_CASE_UPPER (used by mb_strtolower) * MB_CASE_TITLE * MB_CASE_FOLD * MB_CASE_LOWER_SIMPLE * MB_CASE_UPPER_SIMPLE * MB_CASE_TITLE_SIMPLE * MB_CASE_FOLD_SIMPLE (used by case-insensitive operations)	2017-07-28 12:32:50 +02:00
Nikita Popov	9ac7c1e71d	Use case-folding for case insensitive comparisons Instead of using lowercasing.	2017-07-28 12:32:50 +02:00
Nikita Popov	80a0601fe5	Use MPH for case maps Instead of performing a binary search, use a hashtable to store the case maps. In particular a minimal perfect hash construction is used, which does not require collision resolution (but does use an auxiliary table for the hash perturbation).	2017-07-28 12:32:50 +02:00
Nikita Popov	eacd70f762	Don't store titlecase if same as uppercase The totitle code already has a fallback for that case.	2017-07-28 12:32:50 +02:00
Nikita Popov	cedfc2f426	Drop implementation-specific character properties No point in keeping around non-standard character properties if we're not using them and most are not even being populated.	2017-07-28 12:32:50 +02:00
Nikita Popov	8ace7045e9	Handle character ranges in ucgendat generically In particular, the previous implementation did not account for Tangut Ideographs and CJK Ideograph extensions C through F.	2017-07-25 18:48:12 +02:00
Nikita Popov	0c0e35fedc	Port ucgendat to PHP Implemented such that the output is identical, including some quirks that should be fixed subsequently.	2017-07-25 18:48:12 +02:00
Nikita Popov	4bd61ec7ad	Fix handling of some special ranges in ucgendat * Han Ideagraphs go up to U+9FEA. * CJK Compatibility Ideographs are no longer specified as a special range in remotely recent versions of Unicode. * Surrogate properties should be assigned to U+D800-U+DFFF, not to U+10000-U+1FFFF.	2017-07-25 18:48:12 +02:00
Nikita Popov	3c6b2512cb	Change layout of case mapping table Previously the case mapping table was segregated by the type of the character (upper, lower, title) and always stored the other two variants (key, other1, other2). Now the table is segregated by the target type (key, other). As only very few characters have more than one target this only slightly increases the size of the table. The advantage of this layout is that we only need to perform a single table lookup in the case table. Previously, depending on the case that was hit, either one lookup in the property table, or two lookups in the property table and one lookup in the case table were required. This changes the layout from libunicode in the OpenLDAP project -- however, the last commit there was over 10 years ago, so I don't see value in keeping this in sync.	2017-07-23 18:33:15 +02:00
Nikita Popov	24cfbfd56f	Update ucgendat for more bidi properties Handle them the same way as others -- by classifying as Other Neutral.	2017-07-23 16:03:11 +02:00
Nikita Popov	077e61fad3	Fixed bug #69267 completely ucgendat.c was assuming that a title-case character is a character that has both lower and upper-case variants. However, there are title-case characters that only have a lower-case variant. Use the Lt general character proprety to determine where in the case map the character should be placed instead.	2017-07-23 15:30:17 +02:00
Nikita Popov	0e4af9192f	Partial fix for bug #69267 This pulls in 60a25c72ba389f53b0621ca250bc99f3b295d43f from the OpenLDAP project.	2017-07-23 14:47:21 +02:00
olshevskiy87	8bdec7a248	fix typos Signed-off-by: olshevskiy87 <olshevskiy87@bk.ru>	2015-05-13 22:28:35 +04:00
Stanislav Malyshev	b7a7b1a624	trailing whitespace removal	2015-01-10 15:07:38 -08:00
Gustavo André dos Santos Lopes	99807e9a72	- Moved ucgendat.c to a separate directory and included the OpenLDAP license there, as required by the license itself.	2010-10-05 02:34:35 +00:00

27 Commits