Commit Graph

181 Commits

Author SHA1 Message Date
Yasuo Ohgaki
a84e5dc37d Remove unneeded string copy.
Allow to set ''(empty string values) internal/input/output_encoding for better compatibility. i.e. Runtime INI value changes.
More compliance to the RFC. Improve/add encoding handling tests. i.e. Rather than set encoding automagic way, detect it.
2014-03-27 17:20:57 +09:00
Yasuo Ohgaki
e1fe76f28a Add default_charset handling 2014-03-20 10:50:32 +09:00
Yasuo Ohgaki
cbd108abf1 Implement RFC https://wiki.php.net/rfc/default_encoding 2014-02-13 11:54:52 +09:00
Xinchen Hui
c081ce628f Bump year 2014-01-03 11:08:10 +08:00
Christopher Jones
9ad97cd489 Reduce (some) compile noise of 'unused variable' and 'may be used uninitialized' warnings. 2013-08-14 20:36:50 -07:00
Gustavo Lopes
77ee200097 Fix bug #64011 (get_html_translation_table())
get_html_translation_table() with encoding ISO-8859-1 and HTMLENTITIES
was broken. Only entities for characters U+0000 to U+0040 were being
included in the result.
2013-01-18 12:10:27 +01:00
Xinchen Hui
0a7395e009 Happy New Year 2013-01-01 16:28:54 +08:00
Gustavo André dos Santos Lopes
cfdd6c5788 MFH: 7dcada1 for 5.4
- Fixed possible unsigned int wrap around in html.c. Note that 5.3 has the same
  (potential) problem; even though the code is substantially different, the
  variable name and the fashion it was incremented was kept.
2012-03-19 16:36:21 +00:00
Gustavo André dos Santos Lopes
ed98579924 - Fixed bug #61374: html_entity_decode tries to decode code points that don't
exist in ISO-8859-1.
2012-03-13 18:08:30 +00:00
Gustavo André dos Santos Lopes
d4cf399cc4 - Merge r323056 (see bug #60965). 2012-02-05 09:59:33 +00:00
Felipe Pena
4e19825281 - Year++ 2012-01-01 13:15:04 +00:00
Gustavo André dos Santos Lopes
79bb42548d - Less GCC warnings; code less readable, yay!
- Fixed html_tables.h generaration in 64-bit archs.
- Closes bug #55394 - Patch to suppress initialization warnings in html.c
#signed/unsigned mismatches for another day
#regenerated tables on another commit
2011-08-31 05:45:02 +00:00
Xinchen Hui
5540b64a3d Eliminated compiler's warnings 2011-08-10 11:59:11 +00:00
Gustavo André dos Santos Lopes
a61534eab8 - Elided unused argument in internal linkage function. 2011-08-09 00:40:45 +00:00
Gustavo André dos Santos Lopes
547a96090f - Fixed bug #54332 (trunk only, null pointer deref due to information loss on long to int conversion)
- Fixed some int* pointers being passed as size_t*.
2011-03-20 15:15:08 +00:00
Gustavo André dos Santos Lopes
4a946a91e5 - Fixed CHARSET_UNICODE_COMPAT (ISO-8859-1 is compatible in the relevant sense).
- Fixed usage of zend_multibyte_get_internal_encoding (its return cannot be
  cast to char*).
- Change tests to reflect that charset detection now relies on
  internal_encoding, not on current_internal_encoding.
  NOTE: This fixes the changes in rev 306077, but it remains that that change
  introduced a BC break. I assumed it was intentional
2011-01-25 10:57:07 +00:00
Felipe Pena
0203cc3d44 - Year++ 2011-01-01 02:17:06 +00:00
Dmitry Stogov
755c2cd0d8 Removed compile time dependency from ext/mbstring 2010-12-08 11:27:34 +00:00
Pierrick Charron
71dfe80e05 Remove unused variables 2010-11-17 17:55:18 +00:00
Gustavo André dos Santos Lopes
e69b1ff2c4 - Fixed bug #49687 (utf8_decode vulnerabilities and deficiencies in the number
of reported malformed sequences). (Gustavo)
#Made a public interface for get_next_char/utf-8 in trunk to use in utf8_decode.
#In PHP 5.3, trunk's get_next_char was copied to xml.c because 5.3's
#get_next_char is different and is not prepared to recover appropriately from
#errors.
2010-10-27 18:13:25 +00:00
Ilia Alshanetsky
18fa045e75 Code cleanup & CS 2010-10-25 16:46:55 +00:00
Gustavo André dos Santos Lopes
20e2c5fc33 - Fixed uninitialized and 1 character short local variable. 2010-10-24 21:19:04 +00:00
Gustavo André dos Santos Lopes
91727cb844 - Completed rewrite of html.c. Except for determine_charset, almost nothing
remains.
- Fixed bug on determine_charset that was preventing correct detection in
  combination with internal mbstring encoding "none", "pass" or "auto".
- Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0
  and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and
  ENT_HTML5.
- htmlentities()/htmlspecialchars(), when told not to double encode, verify
  the correctness of the existenting entities more thoroughly.
  It is checked whether the numerical entity represents a valid unicode code
  point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED,
  it is also checked whether that numerical entity is valid in selected
  document. In HTML 4.01, all the numerical entities that represent a Unicode
  code point (< U+10FFFFFF) are valid, but that's not the case with other
  document types. If the entity is not valid, & is encoded to &amp;.
  For named entities, the check is also more thorough. While before the only
  check would be to determine if the entity was constituted by alphanumeric
  characters, now it is checked whether that entity is necessarily defined for
  the target document type. Otherwise, & is encoded to &amp;.
- For html_entity_decode(), only valid numerical and named entities (as defined
  above for htmlentities()/htmlspecialchars() + !double_encode) are decoded.
  But there is in this case one additional check. Entities that represent
  non-SGML or otherwise invalid characters are not decoded. Note that, in
  HTML5, U+000D is a valid literal character, but the entity &#x0D is not
  valid and is therefore not decoded.
- The hash tables lazily created for decoding in html_entity_decode() that were
  added recently were substituted by static hash tables. Instead of 1 hash
  table per encoding, there's only one hash table per document type defined in
  terms of unicode code points. This means that for charsets other than UTF-8
  and ISO-8859-1, a conversion to unicode code points is necessary before
  decoding.
- On the encoding side, the ad hoc ranges of entities of the translation
  tables, which mapped (in general) non-unicode code points to HTML entities
  were replaced by three-stage tables for HTML 4 and HTML 5. This mapping
  tables are defined only in terms of unicode code points, so a conversion
  is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the
  multi-stage table is much faster than the previous method, by a factor
  of 5; the conversion to unicode is a small penalty because it's just a
  simple table lookup.
  XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage
  table.
- Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars()
  replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD;
  (other encodings).
- Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot
  appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise).
  An alternative implementation would be to encode those characters into
  numerical entities, but that would only work in HTML 4.01 due to limitations
  on the values of numerical entities in other document types. See also the
  effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 15:01:02 +00:00
Gustavo André dos Santos Lopes
bfcb754eae - Fixed get_next_char(), used by htmlentities/htmlspecialchars, accepting
certain ill-formed UTF-8 sequences.
2010-10-14 19:14:06 +00:00
Gustavo André dos Santos Lopes
4de6c3a948 - Added a 3rd parameter to get_html_translation_table. It now takes a charset
hint, like htmlentities et al.
- Fixed bug #49407 (get_html_translation_table doesn't handle UTF-8).
- Fixed bug #25927 (get_html_translation_table calls the ' &#39; instead of
  &#039;).
- Fixed tests for get_html_translation_table and unified the Windows and
  non-Windows versions of the tests.
2010-10-12 02:51:11 +00:00
Gustavo André dos Santos Lopes
f4a896c209 - PHP uses a big endian representation when it converts the
code unit sequences to integers so as to store the entity
  maps. Code in traverse_for_entities assumed little
  endian. Fixed.
  (in practice, due to the absence of unicode and entity
  mappings for multi-byte encodings -- except UTF-8 --, this
  doesn't matter, so the relevant code was commented out for
  performance reasons).
2010-10-11 22:26:10 +00:00
Gustavo André dos Santos Lopes
7aa43a8d83 - Revamp of the decoding portion of html.c.
- Dramatic improvements on the performance of html_entity_decode and htmlspecialchars_decode, as the
  string is now traversed only once. Speedups of 20 to 25 times with Windows release builds and a
  ~250 characters string (for 2nd and subsequent calls).
- Consistent behavior on html_entity_decode. For instance, the entity in "&&lt;" would be decoded,
  but not "&&#233;". Not anymore. The code path for "basic" and non-basic entities is now mostly
  shared.
- Code of html_entity_decode and htmlspecialchars_decode is now shared.
- [DOC] More consistent behavior of htmlspecialchars_decode. Instead of translating only &lt;, &gt;,
  &amp;, &quot;, &#039; and &#39;, now e.g. &#34;, &apos;, &#0039;, &#x27;, etc. are also decoded.
- [DOC] Previous translation of unicode code points in numerical entities was seriously broken. When
  the code points for some character were not the same in unicode and the target encoding, the
  behavior could be an erroneous translation (e.g. 0x80-0xA0 in win-1252) or no translation at all.
  Added unicode translation tables for all single-byte encodings. Entities are not translated for
  multi-byte entities, except for ASCII characters whose code points are shared. We could add
  the huge translation tables (several thousand elements) for those encodings in the future.
- Fixed numerical entities that after # had text accepted by strcol being accepted.
- Much more commented and well-structured code...
- Tests for get_html_translation_table()) are broken. I stared fixing the tests, but then I realized
  it was completely helpless because get_html_translation_table() is broken by not handling
  multi-byte characters correctly.
2010-10-10 19:04:59 +00:00
Gustavo André dos Santos Lopes
dd5d1b2b66 - Fixed a typo in rev #304208 (24 instead of 34/'"').
- Improved the test bug53021.phpt to reflect other fixes in rev #304208.
- Updated NEWS to reflect other fixes in rev #304208.
2010-10-08 17:27:19 +00:00
Gustavo André dos Santos Lopes
df42830468 - Fixed bug #53021 (In html_entity_decode, failure to convert numeric entities with ENT_NOQUOTES and ISO-8859-1). 2010-10-08 16:19:58 +00:00
Kalle Sommer Nielsen
cb50011016 Fixed compiler warnings in the standard library 2010-09-23 03:45:36 +00:00
Rasmus Lerdorf
906dd4eac5 Switch default_charset, if not specified, from ISO-8859-1 to UTF-8
I have been wanting to make this change for years, but there is a small
chance of BC issues, so it shouldn't go into a minor release.
2010-03-23 18:08:06 +00:00
Moriyoshi Koizumi
73ba495674 - Forgot to commit this patch. Sorry. 2010-03-12 16:19:25 +00:00
Sebastian Bergmann
9ba1e81665 sed -i "s#1997-2009#1997-2010#g" **/*.c **/*.h **/*.php 2010-01-03 09:23:27 +00:00
Moriyoshi Koizumi
7d9a7dbad6 - Fix bug #46478 (htmlentities() uses obsolete mapping table for character
entity references)
2009-12-22 05:50:34 +00:00
Moriyoshi Koizumi
413196c574 - Take account of surrogate pairs. 2009-12-07 15:41:43 +00:00
Moriyoshi Koizumi
20737bac6a - Bug #49785: take 5. What the hell happened to me... 2009-10-13 05:18:37 +00:00
Moriyoshi Koizumi
884cf3f1c0 - Bug #49785: take 4 - typo. this flaw is unharmful since the return value of get_next_char() is only used when UTF-8 is specified to the third argument. 2009-10-12 14:29:45 +00:00
Moriyoshi Koizumi
1835a63dfd - A couple more fix for my previous fix.
(one of the fix by Arnaud Le Blanc. Thanks!)
2009-10-11 23:52:33 +00:00
Moriyoshi Koizumi
9d19866476 - Fixed bug #49785 (insufficient input string validation of htmlspecialchars()). 2009-10-09 10:02:38 +00:00
Sebastian Bergmann
08659c2dcd MFH: Bump copyright year, 3 of 3. 2008-12-31 11:15:49 +00:00
Arnaud Le Blanc
18794addbd MFH: Added ENT_IGNORE as a compatibility flag for htmlentities() and
htmlspecialchars() to skip multibyte sequences intead of returning an
empty string (as iconv's //IGNORE). These functions will still never
return an invalid or incomplete multibyte sequence.
Fixes #43896
2008-11-26 03:00:06 +00:00
Arnaud Le Blanc
a05edaf2bd MFB 5.2 2008-11-26 02:43:16 +00:00
Arnaud Le Blanc
d69dfa4b9f MFH: initialize optional vars 2008-10-21 22:08:38 +00:00
Moriyoshi Koizumi
0699894884 - MFH: beware of signedness 2008-08-18 03:26:21 +00:00
Arnaud Le Blanc
71e50de4fc MFH: Fixed bug #45581 (htmlspecialchars() double encoding &#x hex items) 2008-08-10 13:26:13 +00:00
Felipe Pena
fce4f9600e MFB: Fixed bug #44703 (htmlspecialchars() does not detect bad character set argument) 2008-04-11 19:06:12 +00:00
Stanislav Malyshev
223a53fdeb rm cruft 2008-01-29 22:03:01 +00:00
Antony Dovgal
37a607c7f8 fix #43927 (koi8r is missing from html_entity_decode())
patch by andy at demos dot su
2008-01-28 23:07:12 +00:00
Scott MacVicar
23e3baf62d Fix html_entity_decode when converting numeric html entities, the numeric values for the extended characters don't correspond to that of windows-1251 and cp866. 2008-01-25 18:10:45 +00:00
Sebastian Bergmann
d1dded8751 MFH: Bump copyright year, 2 of 2. 2007-12-31 07:17:19 +00:00