Static builds of ext/imap have duplicate symbols, and so won't link on Windows.
To get around this issue, we simply disallow static building of the extension.
Use _mm_store_si128() instead of _mm_stream_si128(). This ensures that copied memory
is preserved in data cache, which is good as the interpretor will start to use this
data without the need to go back to memory. _mm_stream* is intended to be used for
stores where we want to avoid reading data into the cache and the cache pollution;
in our scenario it seems that preserving the data in cache has a positive impact.
Tests on WordPress 4.1 show ~1% performance increase with fast_memcpy() in place
versus standard memcpy() when running php-cgi -T10000 wordpress/index.php.
I also updated SW prefetching on target memory but its contribution is almost negligible.
The address to be prefetched will be used in a couple of cycles (at the next iteration)
while the data from memory will be available in >100 cycles.
Detecting overflow with the XER is slow, partially because we have to
clear it before use.
PHP already has a fast way of detecting overflow in its fallback
c implementation. Overflow only occurs if the signs of the two
operands are the same and the sign of the result is different.
Furthermore, leaving it in c allows gcc to schedule the instructions
better.
This is 9% faster on a POWER8 running a simple testcase:
<?php
function testcase($count = 100000000) {
$x = 1;
for ($i = 0; $i < $count; $i++) {
$x = $x + 1;
$x = $x + 1;
$x = $x + 1;
$x = $x + 1;
$x = $x + 1;
}
}
testcase();
?>
Detecting overflow with the XER is slow, partially because we have to
clear it before use.
gcc does a better job of detecting overflow of an increment or decrement
than we can with inline assembly. It knows that an increment will only
overflow if it is one less than the overflow value. This means we end
up with a simple compare/branch. Furthermore, leaving it in c allows gcc
to schedule the instructions better.
This is 6% faster on a POWER8 running a simple testcase:
<?php
function testcase($count = 100000000) {
$x = 1;
for ($i = 0; $i < $count; $i++) {
$x++;
$x++;
$x++;
$x++;
$x++;
}
}
testcase();
?>
Detecting overflow with the XER is slow, partially because we have to
clear it before use. We can do better by using a trick where we compare
the high 64 bits of the result with the low 64 bits shifted right
63 bits.
This is 7% faster on a POWER8 running a simple testcase:
<?php
function testcase($count = 100000000) {
for ($i = 0; $i < $count; $i++) {
$x = 1;
$x = $x * 2;
$x = $x * 2;
$x = $x * 2;
$x = $x * 2;
}
}
testcase();
?>