Commit Graph

50 Commits

Author SHA1 Message Date
Alex Dowad
9ac49c0dd3 New implementation of mb_convert_kana
mb_convert_kana now uses the new text encoding conversion
filters. Microbenchmarking shows speed gains of 50%-150%
across various text encodings and input string lengths.

The behavior is the same as the old mb_convert_kana
except for one fix: if the 'zero codepoint' U+0000 appeared
in the input, the old implementation would sometimes drop
it, not passing it through to the output. This is now
fixed.
2022-07-20 07:44:19 +02:00
Alex Dowad
321dbd0413 Implement fast text conversion interface for ISO-2022-JP-2004
There were bugs in the legacy implementation. Lots of them.

It did not properly track whether it has switched to JISX 0213 plane 1
or plane 2. If it processes a character in plane 1 and then immediately
one in plane 2, it failed to emit the escape code to switch to plane 2.

Further, when converting codepoints from 0x80-0xFF to ISO-2022-JP-2004,
the legacy implementation would totally disregard which mode it was
operating in. Such codepoints would pass through directly to the output
without any escape sequences being emitted.

If that was not enough, all the legacy implementations of JISX 0213:2004
encodings had another common bug; their 'flush function' did not call
the next flush function in the chain of conversion filters. So if any
of these encodings were converted to an encoding where the flush
function was needed to finish the output string, then the output
would be truncated.
2022-05-28 21:53:36 +02:00
Alex Dowad
e5fdd5cef2 Implement fast text conversion interface for EUC-JP-2004
All the legacy implementations of JISX 0213:2004 encodings had a
common bug; their 'flush function' did not call the next flush function
in the chain of conversion filters. So if any of these encodings were
converted to an encoding where the flush function was needed to finish
the output string, then the output would be truncated.
2022-05-28 21:53:36 +02:00
Alex Dowad
e2459857af Remove duplicate implementation of CP932 from mbstring
Sigh. Double sigh. After fruitlessly searching the Internet for information on
this mysterious text encoding called "SJIS-open", I wrote a script to try
converting every Unicode codepoint from 0-0xFFFF and compare the results from
different variants of Shift-JIS, to see which one "SJIS-open" would be most
similar to.

The result? It's just CP932. There is no difference at all. So why do we have
two implementations of CP932 in mbstring?

In case somebody, somewhere is using "SJIS-open" (or its aliases "SJIS-win" or
"SJIS-ms"), add these as aliases to CP932 so existing code will continue to
work.
2021-06-17 13:12:40 +02:00
Alex Dowad
e169ad3b61 Consolidate all single-byte encodings in one source file
We can squeeze out a lot of duplicated code in this way.
2020-11-11 11:18:59 +02:00
Alex Dowad
3e7acf901d Remove mbstring identify filters
mbstring had an 'identify filter' for almost every supported text encoding
which was used when auto-detecting the most likely encoding for a string.
It would run over the string and set a 'flag' if it saw anything which
did not appear likely to be the encoding in question.

One problem with this scheme was that encodings which merely appeared
less likely to be the correct one were completely rejected, even if there
was no better candidate. Another problem was that the 'identify filters'
had a huge amount of code duplication with the 'conversion filters'.

Eliminate the identify filters. Instead, when auto-detecting text
encoding, use conversion filters to see whether the input string is valid
in candidate encodings or not. At the same type, watch the type of
codepoints which the string decodes to and mark it as less likely if
non-printable characters (ESC, form feed, bell, etc.) or 'private use
area' codepoints are seen.

Interestingly, one old test case in which JIS text was misidentified
as UTF-8 (and this wrong behavior was enshrined in the test) was 'fixed'
and the JIS string is now auto-detected as JIS.
2020-11-09 13:45:17 +02:00
Alex Dowad
cc03c54c36 Remove useless byte{2,4}{be,le} encodings from mbstring
There is no meaningful difference between these and UCS-{2,4}. They are
just a little bit more lax about passing errors silently. They also have
no known use.

Alias to UCS-{2,4} in case someone, somewhere is using them.
2020-11-09 13:45:16 +02:00
Alex Dowad
62317d592f Remove redundant includes from mbstring (and make sure correct config.h is used)
Very interesting... it turns out that when Valgrind support was enabled,
`#include "config.h"` from within mbstring was actually including the file "config.h"
from Valgrind, and not the one from mbstring!!

This is because -I/usr/include/valgrind was added to the compiler invocation _before_
-Iext/mbstring/libmbfl.

Make sure we actually include the file which was intended.
2020-08-31 23:17:58 +02:00
Alex Dowad
d4ef7ef11d Inline unneeded indirection for mbstring memory management
All memory allocation and deallocation for mbstring bounces through a table of
function pointers before going to emalloc/efree/etc. But this is unnecessary.
The allocators are never swapped out. Better to just call them directly.
2020-08-31 23:16:09 +02:00
Christoph M. Becker
737c1b492c Put oniguruma include path to proper CFLAGS 2019-07-19 20:04:47 +02:00
Christoph M. Becker
504cd03fc3 Move Oniguruma related config stuff to where it belongs
Oniguruma is exclusively used by ext/mbstring, and only if mbregex is
enabled.  Therefore it is unnecessary and confusing to have Oniguruma
related config stuff scattered elsewhere.

While we're at it, we also remove the referral to the bundled libonig
which is removed as of PHP 7.4.0, and the duplicated call to
`PHP_INSTALL_HEADERS()`.
2019-07-19 19:30:41 +02:00
Peter Kokot
359a78b16c Remove unused defines
Used in php-src the past and today removed and not used anymore:
- HAVE_CURL_EASY_STRERROR
- HAVE_CURL_MULTI_STRERROR
- HAVE_NEW_MIME2TEXT
- HAVE_MBSTR_CN
- HAVE_MBSTR_JA
- HAVE_MBSTR_KR
- HAVE_MBSTR_RU
- HAVE_MBSTR_TW

Part of oniguruma which doesn't use these anymore
- NOT_RUBY
- HAVE_STDARG_PROTOTYPES

Unused:
- HAVE_MPIR

Closes GH-4427
2019-07-18 02:21:39 +02:00
Anatol Belski
e10349152b Sync with ZEND_ENABLE_STATIC_TSRMLS_CACHE enablement in ext/mbstring 2019-03-12 21:33:43 +01:00
Anatol Belski
2d7658959e Unbundle oniguruma in config.w32 2019-02-11 14:53:19 +01:00
Nikita Popov
d1c1481081 Unbundle oniguruma
And also switch detection over to pkg-config.
2019-02-11 14:53:19 +01:00
Peter Kokot
7dd62811ce Remove HAVE_STDLIB_H
The C89 and later standard defines the `<stdlib.h>` header as part of
the standard headers [1] and on current systems it is always present
and the `HAVE_STDLIB_H` symbol can be removed.

Also Autoconf suggests doing this and relying on C89 or above [2] and [3].

[1] https://port70.net/~nsz/c/c89/c89-draft.html#4.1.2
[2] http://git.savannah.gnu.org/cgit/autoconf.git/tree/lib/autoconf/headers.m4
[3] https://www.gnu.org/software/autoconf/manual/autoconf-2.69/autoconf.html
2018-09-16 20:53:53 +02:00
Peter Kokot
8d3f8ca12a Remove unused Git attributes ident
The $Id$ keywords were used in Subversion where they can be substituted
with filename, last revision number change, last changed date, and last
user who changed it.

In Git this functionality is different and can be done with Git attribute
ident. These need to be defined manually for each file in the
.gitattributes file and are afterwards replaced with 40-character
hexadecimal blob object name which is based only on the particular file
contents.

This patch simplifies handling of $Id$ keywords by removing them since
they are not used anymore.
2018-07-25 00:53:25 +02:00
Christoph M. Becker
d48b233991 Update to Oniguruma 6.7.1
We also apply the still relevant parts of `oniguruma.patch` and update
the patch accordingly.
2018-03-10 01:07:00 +01:00
Peter Kokot
5c5bd30339 Remove --with-libmbfl configure option
The bundled libmbfl library is no longer API or ABI compatible with
the (currently unmaintained) upstream library. As such, building
against an external libmbfl is no longer possible.
2017-10-28 16:11:30 +02:00
Anatol Belski
2a76d2282a upgrade to Oniguruma 6.1.2 2016-11-25 22:00:53 +01:00
Anatol Belski
864cd82ace Merge remote-tracking branch 'origin/master' into native-tls
* origin/master:
  updated NEWS
  refactored the mbstring config.w32
  Update NEWS
  Fixed compilation warnings
  Fixed bug #68504 --with-libmbfl configure option not present on Windows
  Changed "finally" handling. Removed EX(fast_ret) and EX(delayed_exception). Allocate and use additional IS_TMP_VAR slot on VM stack instead.
  the darwin specific test fails for me with the same output which is the expected for the original test I couldn't find anybody who managed to see this test passing, but I found a bunch of other reports on qa.php.net/reports and on google which do see this test failing on mac. if this change causes you to have this test failing on Mac, please drop me a mail so we can improve the current test so it passes for everybody.
  #68446 is fixed
  Reimplemented silence operator (@) handling on exceptions. Now each silence region is stored in op_array->brk_cont_array. On exception ZEND_HANDLE_EXCEPTION handler traverse this array and restore original EG(error_reporting) if exception occured inside a "silence" region.
  remove the NEWS entries for the reverted stuff
  typo fix
  go back with phpdbg to the state of 5.6.3, reverting the controversial commits(remote debugging/xml protocol)
  5.5.21 now
  New label length test
  Fix ext/filter/tests/033.phpt
  Fix filter_list test
  FILTER_VALIDATE_DOMAIN and RFC conformance for FILTER_VALIDATE_URL

Conflicts:
	ext/mbstring/config.w32
2014-11-27 15:59:43 +01:00
Anatol Belski
42af411620 refactored the mbstring config.w32 2014-11-27 13:37:00 +01:00
Anatol Belski
3ec8730e89 Fixed bug #68504 --with-libmbfl configure option not present on Windows 2014-11-27 09:14:47 +01:00
Anatol Belski
0490a32249 more exts converted for static tsrm ls pointer
mbstring, pcre, reflection
2014-10-15 19:19:23 +02:00
Rui Hirokawa
4122ef275c added iso2022jp-mobile and emoji unsuppoted in unicode 6.0. 2011-08-24 15:28:44 +00:00
Pierre Joye
60dd9e0bd9 - fix typo & build 2011-08-22 07:39:09 +00:00
Rui Hirokawa
c746cf5dc9 updated libmbfl to 1.3.2 (JISX-0213:2004 support). 2011-08-20 07:24:04 +00:00
Rui Hirokawa
484e6b8fb3 added gb18030 encoding to mbstring/libmbfl.~ 2011-08-14 14:09:11 +00:00
Rui Hirokawa
1ec46d3fe3 fixed win32 build. 2011-08-13 12:53:40 +00:00
Rui Hirokawa
52948b534c added new files of libmbfl 1.3.0. 2011-08-02 02:50:11 +00:00
Pierre Joye
a7ffa09e18 - add PHP_INSTALL_HEADERS to all parts (core&exts) exposing headers, generate the install-headers cmd 2010-12-11 22:18:10 +00:00
Moriyoshi Koizumi
872f07aa5e - Fix win32 build. (notified by Rob. Thanks) 2010-03-15 14:19:51 +00:00
Moriyoshi Koizumi
d9dda48f8a - Update the bundled libmbfl to the latest on upstream. 2010-03-12 04:55:37 +00:00
Kalle Sommer Nielsen
4b17fee3b9 Fixed static build of mbstring on Windows (makes static build of exif possible too) 2009-06-11 23:37:51 +00:00
Jani Taskinen
a0f3cf5cc4 MFB: Thanks to the "maintainers" who are too lazy to commit FIRST to HEAD! 2009-04-20 17:06:03 +00:00
Moriyoshi Koizumi
935fa7a97e - Fix win32 build 2008-07-24 16:59:53 +00:00
Antony Dovgal
d7ab2da30b there is no such file 2007-07-16 19:07:22 +00:00
Frank M. Kromann
16ccbf0c0c MFB: Fix win32 build 2007-02-04 00:23:32 +00:00
Frank M. Kromann
af741730f4 Fix win32 build 2006-11-04 17:25:37 +00:00
Rui Hirokawa
bcf3a3311d added turkish language support for libmbfl. 2005-12-23 13:53:30 +00:00
Moriyoshi Koizumi
542901d705 - Add Armenian encoding / NLS (patch by Hayk Chamyan) 2005-03-22 22:22:11 +00:00
Moriyoshi Koizumi
5b5e012bc2 - Update libmbfl (fixes bug #30549 and #31911).
- Update oniguruma to 3.7.0
2005-02-20 22:18:09 +00:00
Wez Furlong
a8757b11e6 Enable mbregex in win32 build 2004-04-08 11:01:51 +00:00
Moriyoshi Koizumi
a91e44c830 - Add missing include path. 2004-03-03 10:27:19 +00:00
Moriyoshi Koizumi
9e9d7d1743 - proper DLL linkage specifier.
# oniguruma.h:34-
#
# #ifndef ONIG_EXTERN
# #if defined(_WIN32) && !defined(__CYGWIN__)
# #if defined(EXPORT) || defined(RUBY_EXPORT)
# #define ONIG_EXTERN   extern __declspec(dllexport)
# #else
# #define ONIG_EXTERN   extern __declspec(dllimport)
# #endif
# #endif
# #endif
2004-03-02 22:38:21 +00:00
Moriyoshi Koizumi
bc4d64477a - Fix typo. 2004-03-02 20:18:14 +00:00
Moriyoshi Koizumi
1dfd0bd901 - Really fix the build.
# Should be fixed now :|
2004-03-02 15:59:30 +00:00
Edin Kadribasic
f067c0479f Temporary fix for win32 build 2004-03-02 11:50:10 +00:00
Moriyoshi Koizumi
03bdd13560 - Fix win32 build.
# Thanks Nuno Lopes & Derick for letting me know.
2004-03-01 20:25:33 +00:00
Wez Furlong
05b9b20ed8 Add new (optional!) win32 build infrastructure.
Will follow up to internals@ shortly.
2003-12-02 23:17:04 +00:00