Perl 5.38.2 string interpolation bug!?

Discussion:

(too old to reply)

Max via perl5-porters

2024-03-28 11:05:22 UTC

Dear Perl 5 Porters, I am updating from Perl 5.26.1 to 5.38.2. These
versions have a different behavior in the interpolation of strings.
While 5.26.1 is evaluating each @{[ $Var++ ]} from left to right one
after another, 5.38.2 seems to first evaluate each @{[ $Var++ ]} and
construct the string thereafter returning a wrong value for the first
@{[ $Var++ ]} value at that point of interpolation. ###############
TestInterpolateString.pl #################### $Var = 0; $String =
"Interpolated string: @{[ $Var++ ]} $Var @{[ $Var++ ]} $Var"; print
$String . "\n"; 1;
###################################################### >
/opt/perl/5.26.1/bin/perl TestInterpolateString.pl Interpolated string:
0 1 1 2 > /opt/perl/5.38.2/bin/perl TestInterpolateString.plInterpolated
string: 0 2 1 2 > /opt/perl/5.34.0/bin/perl
TestInterpolateString.plInterpolated string: 0 2 1 2
######################################################
Would you agree that this is a bug in 5.38.2 or is the change intended?
I do prefer the 5.26.1 behavior, as it is more logical and allows
consequtive changes within the string by each @{[ ]} expression. Do you
think that in future Perl versions the behavior will fall back to that
of Perl 5.26.1? Thank you in advance. Sincerely, Max.

James E Keenan

2024-03-28 17:01:33 UTC

Permalink

Post by Max via perl5-porters
Dear Perl 5 Porters, I am updating from Perl 5.26.1 to 5.38.2. These
versions have a different behavior in the interpolation of strings.
construct the string thereafter returning a wrong value for the first
@{[ $Var++ ]} value at that point of interpolation. ###############
TestInterpolateString.pl #################### $Var = 0; $String =
$String . "\n"; 1;
###################################################### >
0 1 1 2 > /opt/perl/5.38.2/bin/perl TestInterpolateString.plInterpolated
string: 0 2 1 2 > /opt/perl/5.34.0/bin/perl
TestInterpolateString.plInterpolated string: 0 2 1 2
######################################################
Would you agree that this is a bug in 5.38.2 or is the change intended?
I do prefer the 5.26.1 behavior, as it is more logical and allows
think that in future Perl versions the behavior will fall back to that
of Perl 5.26.1? Thank you in advance. Sincerely, Max.

The behavior you mentioned changed between 5.26 and 5.28 in the
following commit, which introduced the multiconcat op:

#####
$ gitshowf e839e6ed99c6b25aee589f56bb58de2f8fa00f41
commit e839e6ed99c6b25aee589f56bb58de2f8fa00f41
Author: David Mitchell <***@iabyn.nospamdeletethisbit.com>
AuthorDate: Tue Aug 8 18:42:14 2017 +0100
Commit: David Mitchell <***@iabyn.nospamdeletethisbit.com>
CommitDate: Tue Oct 31 15:31:26 2017 +0000

Add OP_MULTICONCAT op

Allow multiple OP_CONCAT, OP_CONST ops, plus optionally an OP_SASSIGN
or OP_STRINGIFY, to be combined into a single OP_MULTICONCAT op,
which can
make things a *lot* faster: 4x or more.

In more detail: it will optimise into a single OP_MULTICONCAT, most
expressions of the form

LHS RHS

where LHS is one of

(empty)
my $lexical =
$lexical =
$lexical .=
expression =
expression .=

and RHS is one of

(A . B . C . ...) where A,B,C etc are expressions and/or
string constants

"aAbBc..." where a,A,b,B etc are expressions
and/or
string constants

sprintf "..%s..%s..", A,B,.. where the format is a constant string
containing only '%s' and '%%'
elements,
and A,B, etc are scalar
expressions (so
only a fixed, compile-time-known
number of
args: no arrays or list context
function
calls etc)

It doesn't optimise other forms, such as

($a . $b) . ($c. $d)

((($a .= $b) .= $c) .= $d);

(although sub-parts of those expressions might be converted to an
OP_MULTICONCAT). This is partly because it would be hard to
maintain the
correct ordering of tie or overload calls.

The compiler uses heuristics to determine when to convert: in general,
expressions involving a single OP_CONCAT aren't converted, unless some
other saving can be made, for example if an OP_CONST can be
eliminated, or
in the presence of 'my $x = .. ' which OP_MULTICONCAT can apply
OPpTARGET_MY to, but OP_CONST can't.

The multiconcat op is of type UNOP_AUX, with the op_aux structure
directly
holding a pointer to a single constant char* string plus a list of
segment
lengths. So for

"a=$a b=$b\n";

the constant string is "a= b=\n", and the segment lengths are (2,3,1).
If the constant string has different non-utf8 and utf8 representations
(such as "\x80") then both variants are pre-computed and stored in
the aux
struct, along with two sets of segment lengths.

For all the above LHS types, any SASSIGN op is optimised away. For
a LHS
of '$lex=', '$lex.=' or 'my $lex=', the PADSV is optimised away too.

For example where $a and $b are lexical vars, this statement:

my $c = "a=$a, b=$b\n";

formerly compiled to

const[PV "a="] s
padsv[$a:1,3] s
concat[t4] sK/2
const[PV ", b="] s
concat[t5] sKS/2
padsv[$b:1,3] s
concat[t6] sKS/2
const[PV "\n"] s
concat[t7] sKS/2
padsv[$c:2,3] sRM*/LVINTRO
sassign vKS/2

and now compiles to:

padsv[$a:1,3] s
padsv[$b:1,3] s
multiconcat("a=, b=\n",2,4,1)[$c:2,3] vK/LVINTRO,TARGMY,STRINGIFY

In terms of how much faster it is, this code:

my $a = "the quick brown fox jumps over the lazy dog";
my $b = "to be, or not to be; sorry, what was the question again?";

for my $i (1..10_000_000) {
my $c = "a=$a, b=$b\n";
}

runs 2.7 times faster, and if you throw utf8 mixtures in it gets even
better. This loop runs 4 times faster:

my $s;
my $a = "ab\x{100}cde";
my $b = "fghij";
my $c = "\x{101}klmn";

for my $i (1..10_000_000) {
$s = "\x{100}wxyz";
$s .= "foo=$a bar=$b baz=$c";
}

The main ways in which OP_MULTICONCAT gains its speed are:

* any OP_CONSTs are eliminated, and the constant bits (already in the
right encoding) are copied directly from the constant string
attached to
the op's aux structure.

* It optimises away any SASSIGN op, and possibly a PADSV op on the
LHS, in
all cases; OP_CONCAT only did this in very limited circumstances.

* Because it has a holistic view of the entire concatenation
expression,
it can do the whole thing in one efficient go, rather than
creating and
copying intermediate results. pp_multiconcat() goes to considerable
efforts to avoid inefficiencies. For example it will only
SvGROW() the
target once, and to the exact size needed, no matter what mix of utf8
and non-utf8 appear on the LHS and RHS. It never allocates any
temporary SVs except possibly in the case of tie or overloading.

* It does all its own appending and utf8 handling rather than calling
out to functions like sv_catsv().

* It's very good at handling the LHS appearing on the RHS; for
example in

$x = "abcd";
$x = "-$x-$x-";

It will do roughly the equivalent of the following (where targ is
$x);

SvPV_force(targ);
SvGROW(targ, 11);
p = SvPVX(targ);
Move(p, p+1, 4, char);
Copy("-", p, 1, char);
Copy("-", p+5, 1, char);
Copy(p+1, p+6, 4, char);
Copy("-", p+10, 1, char);
SvCUR(targ) = 11;
p[11] = '\0';

Formerly, pp_concat would have used multiple PADTMPs or temporary
SVs to
handle situations like that.

The code is quite big; both S_maybe_multiconcat() and pp_multiconcat()
(the main compile-time and runtime parts of the implementation) are
over
700 lines each. It turns out that when you combine multiple ops, the
number of edge cases grows exponentially ;-)
#####

We certainly haven't had this described as a bug until now, but I'll let
Dave Mitchell and others comment further.

Dave Mitchell

2024-03-28 18:03:38 UTC

Permalink

Post by Max via perl5-porters
Would you agree that this is a bug in 5.38.2 or is the change intended?

As has been pointed out, this change is a side-effect of an optimisation,
which has changed undefined behaviour.

String interpolation is just syntactic sugar for string concatenation.
This simpler code exhibits a similar change in behaviour:

$s = $i++ . $i . $i++ . $i;

where the value of $s changes from 0112 to 0212.

But, look at a similar list assignment:

@a = ($i++, $i, $i++, $i);

Here, the contents of @a both was, and still is, (0, 2, 1, 2). Given that
inconsistency, it would be hard to argue that one behaviour or the other
is the "correct" one.

Post by Max via perl5-porters
Do you think that in future Perl versions the behavior will fall back to
that of Perl 5.26.1?

Practically zero chance of the behaviour reverting.

--
If life gives you lemons, you'll probably develop a citric acid allergy.

Paul "LeoNerd" Evans

2024-03-28 18:13:19 UTC

Permalink

On Thu, 28 Mar 2024 18:03:38 +0000

Post by Dave Mitchell

Post by Max via perl5-porters
Would you agree that this is a bug in 5.38.2 or is the change intended?

As has been pointed out, this change is a side-effect of an
optimisation, which has changed undefined behaviour.
String interpolation is just syntactic sugar for string concatenation.
$s = $i++ . $i . $i++ . $i;
where the value of $s changes from 0112 to 0212.
@a = ($i++, $i, $i++, $i);
that inconsistency, it would be hard to argue that one behaviour or
the other is the "correct" one.

Agree; it's very similar to other examples from other languages, such
as the classic C

int x = 0;
printf("%d %d\n", x++, x++);

Afterwards, x is 2, but we can't say for sure what numbers get printed.

Post by Dave Mitchell

Post by Max via perl5-porters
Do you think that in future Perl versions the behavior will fall
back to that of Perl 5.26.1?

Practically zero chance of the behaviour reverting.

Agree.

--
Paul "LeoNerd" Evans

***@leonerd.org.uk | https://metacpan.org/author/PEVANS
http://www.leonerd.org.uk/ | https://www.tindie.com/stores/leonerd/

Darren Duncan

2024-03-28 22:36:01 UTC

Permalink

Given that you are relying on undefined behavior, and also that your example of
incrementing the same variable twice as you did in the same expression seems
dubious logically, your best solution is to separate this out into multiple
statements, where you first assign the result of the 2 increments to 2 variables
that indicate the separate meaning of each separate increment, and then catenate
the final string from those afterwards where no incrementing is in the same
statement as the catenation, and then you should have clearly defined and
predictable behavior. -- Darren Duncan

Dear Perl 5 Porters, I am updating from Perl 5.26.1 to 5.38.2. These versions
have a different behavior in the interpolation of strings. While 5.26.1 is
interpolation. ############### TestInterpolateString.pl ####################
$Var"; print $String . "\n"; 1;
###################################################### >
/opt/perl/5.26.1/bin/perl TestInterpolateString.pl Interpolated string: 0 1 1 2

/opt/perl/5.38.2/bin/perl TestInterpolateString.plInterpolated string: 0 2 1

2 > /opt/perl/5.34.0/bin/perl TestInterpolateString.plInterpolated string: 0 2 1
2 ######################################################
Would you agree that this is a bug in 5.38.2 or is the change intended? I do
prefer the 5.26.1 behavior, as it is more logical and allows consequtive changes
versions the behavior will fall back to that of Perl 5.26.1? Thank you in
advance. Sincerely, Max.