* break_iterator:
Fix typo in error message
BreakIterator: fix compat with old ICU versions
Fix build error one ext/intl
BreakIterator::getPartsIterator: new optional arg
Added IntlCodePointBreakIterator.
Add Intl prefix to BreakIterator/RuleBasedBI
Remove trailing space
Replaced zend_parse_method_params with plain zpp
BreakIter: Removed getAvailableLocales/getHashCode
Change in BreakIterator::getPartsIterator()
BreakIterator: add rules status constants
Tests for (RuleBased)BreakIterator.
BreakIterator and RuleBasedBreakiterator added
This was causing segfaults at least in the resourcebundle
constructor.
Also moved intl_locale_get_default() to a more central location
and fixed a constness warning in resourcebundle_ctor().
Can take one of:
* IntlPartsIterator::KEY_SEQUENTIAL (keys are 0, 1, ...)
* IntlPartsIterator::KEY_LEFT (keys are left boundaries)
* IntlPartsIterator::KEY_LEFT (keys are right boundaries)
The default is IntlPartsIterator::KEY_SEQUENTIAL (the previous behavior).
Objects of this class can be instantiated with
IntlBreakIterator::createCodePointInstance()
The method does not take a locale, as it would not make sense in this
context.
This class has one additional method:
long IntlCodePointIterator::getLastCodePoint()
which returns either -1 or the last code point we moved over, if any
(and discounting any movement before the last call to
IntlBreakIterator::first() or IntlBreakIterator::last()).
BreakIterator::getPartsIterator() now returns an IntlIterator subclass
with a special method, getBreakIterator(), that returns the
associated BreakIterator.
Any call to getRuleStatus() is forwarded to the BreakIterator.
By Gustavo André dos Santos Lopes (4) and others
via Felipe Pena (2) and Xinchen Hui (2)
* PHP-5.4:
Remove unused codes
based on microsoft's description,the direct convert from FILETIME struct to __int64 is unsafe.
merge 5.3 entries
restore NEWS
Fix ext/intl build on ICU < 4.8
Optimization in ext/intl/msgformat
Fixed tests in ext/intl
Changed XFAILed collator_get_sort_key.phpt
By Gustavo André dos Santos Lopes (4) and others
via Felipe Pena (1) and Xinchen Hui (1)
* PHP-5.3:
Remove unused codes
based on microsoft's description,the direct convert from FILETIME struct to __int64 is unsafe.
Fix ext/intl build on ICU < 4.8
Optimization in ext/intl/msgformat
Fixed tests in ext/intl
Changed XFAILed collator_get_sort_key.phpt
This commit adds wrappers for the classes BreakIterator and
RuleBasedbreakIterator. The C++ ICU classes are described here:
<http://icu-project.org/apiref/icu4c/classBreakIterator.html>
<http://icu-project.org/apiref/icu4c/classRuleBasedBreakIterator.html>
Additionally, a tutorial is available at:
<http://userguide.icu-project.org/boundaryanalysis>
This implementation wraps UTF-8 text in a UText. The text is
iterated without any copying or conversion to UTF-16. There is
also no validation that the input is actually UTF-8; where there
are malformed sequences, the UText will simply U+FFFD.
The class BreakIterator cannot be instantiated directly (has a
private constructor). It provides the interface exposed by the ICU
abstract class with the same name. The PHP class is not abstract
because we may use it to wrap native subclasses of BreakIterator
that we don't know how to wrap. This class includes methods to
move the iterator position to the beginning (first()), to the
end (last()), forward (next()), backwards (previous()), to the
boundary preceding a certain position (preceding()) and following
a certain position (following()) and to obtain the current position
(current()). next() can also be used to advance or recede an
arbitrary number of positions.
BreakIterator also exposes other native methods:
getAvailableLocales(), getLocale() and factory methods to build
several predefined types of BreakIterators: createWordInstance()
for word boundaries, createCharacterInstance() for locale
dependent notions of "characters", createSentenceInstance() for
sentences, createLineInstance() and createTitleInstance() -- for
title casing breaks. These factories currently return
RuleBasedbreakIterators where the names of the rule sets are found
in the ICU data, observing the passed locale (although the locale
is taken into considering there are very few exceptions to the
root rules).
The clone and compare_object PHP object handlers are also
implemented, though the comparison does not yield meaningful results
when used with >, <, >= and <=.
Note that BreakIterator is an iterator only in the sense of the
first 'Iterator' in 'IteratorIterator', i.e., it does not
implement the Iterator interface. The reason is that there is
no sensible implementation for Iterator::key(). Using it for
an ordinal of the current boundary is not feasible because
we are allowed to move to any boundary at any time. It we were
to determine the current ordinal when last() is called we'd
have to traverse the whole input text to find out how many
breaks there were before. Therefore, BreakIterator implements
only Traversable. It can be wrapped in an IteratorIterator,
but the usual warnings apply.
Finally, I added a convenience method to BreakIterator:
getPartsIterator(). This provides an IntlIterator, backed
by the BreakIterator PHP object (i.e. moving the pointer or
changing the text in BreakIterator affects the iterator
and also moving the iterator affects the backing BreakIterator),
which allows traversing the text between each boundary.
This iterator uses the original text to retrieve the text
between two positions, not the code points returned by the
wrapping UText. Therefore, if the text includes invalid code
unit sequences, these invalid sequences will be in the output
of this iterator, not U+FFFD code points.
The class RuleBasedIterator exposes a constructor that allows
building an iterator from arbitrary compiled or non-compiled
rules. The form of these rules in described in the tutorial linked
above. The rest of the methods allow retrieving the rules --
getRules() and getCompiledRules() --, a hash code of the rule set
(hashCode()) and the rules statuses (getRuleStatus() and
getRuleStatusVec()).
Because the RuleBasedBreakIterator constructor may return parse
errors, I reuse the UParseError to text function that was in the
transliterator files. Therefore, I move that function to
intl_error.c.
common_enum.cpp was also changed, mainly to expose previously
static functions. This avoided code duplication when implementing
the BreakIterator iterator and the IntlIterator returned by
BreakIterator::getPartsIterator().
Ressurected and limited to ICU 4.8 in the hope that the sort keys
will remain stable in more recent ICU versions. I have only tested
with ICU 4.8 so far.
The following changes were made:
* The IntlDateFormatter constructor now accepts the usual values
for its $timezone argument. This includes timezone identifiers,
IntlTimeZone objects, DateTimeZone objects and NULL. An empty
string is not accepted. An invalid time zone is no longer accepted
(it used to use UTC in this case).
* When NULL is passed to IntlDateFormatter, the time zone specified in
date.timezone is used instead of the ICU default.
* The IntlDateFormatter $calendar argument now accepts also an
IntlCalendar. In this case, IntlDateFormatter::getCalendar() will
return false.
* The time zone passed to the IntlDateFormatter is ignored if it is
NULL and if the calendar passed is an IntlCalendar object -- in this
case, the IntlCalendar time zone will be used instead. Otherwise,
the time zone specified in the $timezone argument is used instead.
* Added IntlDateFormatter::getCalendarObject(), which always returns
the IntlCalendar object that backs the DateFormat, even if a
constant was passed to the constructor, i.e., if an IntlCalendar
was not passed to the constructor.
* Added IntlDateFormatter::setTimeZone(). It accepts the usual values
for time zone arguments. If NULL is passed, the time zone of the
IntlDateFormatter WILL be overridden with the default time zone,
even if an IntlCalendar object was passed to the constructor.
* Added IntlDateFormatter::getTimeZone(), which returns the time zone
that's associated with the DateFormat.
* Depreacated IntlDateFormatter::setTimeZoneId() and made it an alias
for IntlDateFormatter::setTimeZone(), as the new ::setTimeZone()
also accepts plain identifiers, besides other types.
IntlDateFormatter::getTimeZoneId() is not deprecated however.
* IntlDateFormatter::setCalendar() with a constant passed should now
work correctly. This requires saving the requested locale to the
constructor.
* Centralized the hacks required to avoid compilation disasters on
Windows due to some headers being included inside and outside of
extern "C" blocks.
There's no change from the intended behavior. If INTL_G(default_locale)
is NULL, the default ICU locale, as given by locale_get_default() in
master, will still be used by ures_open().
null is now accepted for two first (mandatory arguments).
Passing null as the package name causes NULL to be passed to ICU and
the default ICU data to be loaded.
Passing null as the locale name causes the default locale to be used.
Memory leak in IntlDateFormatter constructor.
udat_setCalendar() clones the calendar before it adopts it,
so we were leaking the original calendar.
Also we now validate the calendar type.
I don't think the current ICU API allows this bug to be completely fixed.
Right now, the code cannot control the time zone used in date/time formats
that appear inside complex subformats. See the comment inside
umsg_set_timezone().
IntlTimeZone::fromDateTimeZone(DateTimeZone $dtz) converts from an
ext/date TimeZone to an IntlTimeZone. The conversion is done by feeding
the time zone name (essentially what would be given by
DateTimeZone::getName()) to ICU's TimeZone::createTimeZone except if it's
an offset time zone. In that case, the offset is read from the ext/date
time zone object structure and an appopriate id (of the form
GMT<+|-><HH:MM>) is given to ICU's TimeZone::createTimeZone. Not all
ext/date time zones are recognized for ICU. For instance, WEST is not.
Note that these kind of abbreviations, as far as I can tell, can only be
created via ext/date DateTime, not directly through DateTimeZone's
constructor.
For IntlTimeZone::toDateTimeZone(), the behavior is symmetrical.
We instantiate a DateTimeZone and then call its constructor if we don't
have an offset time zone, otherwise we mess with its structure. If the
timezone is not valid for ext/date, then we allow the exception of
DateTimeZone constructor to propagate.
IntlCalendar::fromDateTime(DateTime|string $dateTime[, string $locale)
intlcal_from_date_time(...)
If a string is given as the first argument, the method will try to
instantiate a new DateTime object and use that instead.
MessageFormatter::parse and MessageFormat::format (and their static
equivalents) now don't throw away better than second precision in the
arguments.
It's already bad enough that in MessageFormatter and IntlDateFormatter we
use seconds since epoch instead of milliseconds since epoch, deviating
from the ICU date representations. But we don't need to throw away extra
precision when parsing dates; we can keep the seconds since epoch
convention and return non integer doubles with only a small BC impact.
Note that we already could return doubles from MessageFormatter::parse if
the date was sufficiently in the past or in the future.
The check does not work reliably across ICU versions when named arguments
are added to the mix. For instance, for recent versions of ICU like 49,
a pattern like "{foo,number} {foo}", has 0 returned from
umsg_format_arg_count(), but for ICU 4.0, this returns 2.