observations on arenas and fragmentation

Discussion:

(too old to reply)

Eric Wong

2024-02-22 13:20:56 UTC

Hello, I'm using a hacky LD_PRELOAD malloc wrapper[1] on Perl
daemons and noticing some 4080 byte allocations made late in
process lifetime lingering forever(?).

I'm not an expert in Perl internals, but my reading of sv.c
doesn't reveal arenas getting freed outside of process teardown.

With long-lived processes, permanent allocations made well after
a process enters steady state tends to cause fragmentation.
(steady state being when it's in a main loop and expected to
mainly do short-lived allocations)

For C10K network servers, a sudden burst of traffic to certain
endpoints well after startup seems likely to trigger this
fragmentation.

Unfortunately, C stdlib malloc has no way to declare the
expected lifetime of an allocation.

Perl itself could probably use anonymous mmap for arenas on
platforms where it's supported. mmap would also allow using
page-sized arenas instead of the awkward 4080 size.

Unfortunately, mmap would be slower for short-lived processes.
Figuring out a way to release arenas during runtime could be
done, but would also add more complexity and might hurt
performance.

A hybrid approach that switches to mmap in long-lived processes
might work, but Perl would need a way to know when it's entered
a steady state of a long-lived processes.

A possible mitigations for long-lived Perl code:

# attempt to create many arenas at startup
# (where PREALLOC_NR is a really big number in env)
BEGIN {
if (my $nr = $ENV{PREALLOC_NR}) {
my @tmp = map { $_ } (0..$nr);
}
}

There's also a lot of other stuff (regexps, stratchpads, `state')
which can create late permanent allocations. I'm not sure what
to do about those, yet.

Maybe it's just easier to restart processes if it grows to a
certain size. *shrug*

[1] git clone https://80x24.org/mwrap-perl.git
PS: I used mwrap-perl to find an Encode leak (RT #139622)
some years back. *Sigh* I can't view rt.cpan.org anymore
due to JS: <https://rt.cpan.org/Ticket/Display.html?id=139622>

"Ruud H.G. van Tol" via perl5-porters

2024-02-22 15:07:55 UTC

Permalink

[...] Unfortunately, mmap would be slower for short-lived processes.
Figuring out a way to release arenas during runtime could be
done, but would also add more complexity and might hurt
performance.
A hybrid approach that switches to mmap in long-lived processes
might work, but Perl would need a way to know when it's entered
a steady state of a long-lived processes.

Maybe add a command switch for it?

Looks like "ABbGHJjKkLOoNQqRrYyZz" are still available,
maybe pick L for Long. Or some environment variable.

-- Ruud

G.W. Haywood

2024-02-22 15:18:14 UTC

Permalink

Hi there,

Post by Eric Wong
Hello, I'm using a hacky LD_PRELOAD malloc wrapper[1] on Perl
daemons and noticing some 4080 byte allocations made late in
process lifetime lingering forever(?).
...
...
Maybe it's just easier to restart processes if it grows to a
certain size. *shrug*

This strikes a chord. Maybe a nerve.

Routinely I run fifty or so Perl milter daemons 24/7. To avoid any
potential problems like memory leaks [*] I get the deamons to die and
respawn after they've processed some number of messages. At the
moment it's either a hundred or a thousand for each daemon. So far,
everything seems to be OK like that.

[*] Not that I can say definitively that there are any leaks, but the
daemons do tend to use anywhere between 20 and 50 MBytes each.

If you need memory usage stats I can probably furnish some going back
several years and if (as I think may be likely) there's nothing useful
to you in the data I have lying around, I'd be happy to put some code
in the milters to exercise anything you'd like to try tweaking. Maybe
I could even run a non-production Perl but I wouldn't want to take too
many risks with the mail flow so that would need a bit of thought.

At the moment I'm using Perl 5.32.1, patched all to hell by Debian:

8<----------------------------------------------------------------------
$ perl -v

This is perl 5, version 32, subversion 1 (v5.32.1) built for i686-linux-gnu-thread-multi-64int
(with 47 registered patches, see perl -V for more detail)
...
8<----------------------------------------------------------------------

--
73,
Ged.

Eric Wong

2024-02-23 18:44:46 UTC

Permalink

Post by G.W. Haywood
If you need memory usage stats I can probably furnish some going back
several years and if (as I think may be likely) there's nothing useful
to you in the data I have lying around, I'd be happy to put some code
in the milters to exercise anything you'd like to try tweaking. Maybe
I could even run a non-production Perl but I wouldn't want to take too
many risks with the mail flow so that would need a bit of thought.

Outside of data gathered by the mwrap-perl LD_PRELOAD or
similar malloc tracers, I'm not sure historical data you
have is of much use.

The trivial BEGIN{} snippet in my original mail seems to have
helped (with PREALLOC_NR=500000 in my case). Will wait a few
more days to be sure... (but ~2G to ~230M seems nice)

A few other things I've done in a codebase I maintain:

* avoid lazy-loading (including Encode::* that's loaded lazily)
* avoid lazy setup/initialization in general
* routinely expire late/lazy DB connections (SQLite caching)
* build giant strings via PerlIO::scalar (not 100% sure about this)

Something I might also do:

* forcibly exercise cold codepaths at startup
Same here. Even worse for me, a major user of my project
uses 5.16.3 from RHEL/CentOS 7 so I have to do weird stuff like
avoiding ref() on blessed references to avoid leaks :<

G.W. Haywood

2024-02-24 12:55:18 UTC

Permalink

Hi there,

... I've already gone through all the code+modules I depend on to
chase down cycles and other common sources of leaks. All that
remains is what Perl does outside a users' direct control.

Then you've been busy! :)

A moving/compacting GC (not Boehm) would help with
fragmentation. Retrofiting that into an existing C codebase
(not to mention hundreds of XS modules) would be a monumental
effort and not feasible.

As you say, it's probably infeasible to retrofit into the existing
codebase, but maybe something could be worked into the compiler or
maybe the libraries?

I've often thought that there's room for a better malloc(). In fact
something like 27 years ago I wrote one, which I'm still using today.
It writes guard byte patterns around every malloc()ed chunk, and at
any operation which accesses the guarded memory it checks the guards.

If a guard byte gets changed it calls a panic (and I'd immediately get
a telephone call) which hasn't happened for more than twenty years of
running this code all day every working day at a number of businesses.

My guarding is to protect the integrity of the data, not for garbage
collection, but I'm sure that it could be extended for other purposes
including garbage collection and especially security.

I think that I can see ways of using the guard byte pattern to flag
freed memory, and thus let you move things around in memory in ways
transparent to the calling processes so that you could then collect
garbage. You'd need something like a linked list structure I suppose
but then you *could* have a sort of garbage-collected C. Digression:
maybe this might find broken code that's already Out There.

I'm comfortable with the performance hit - when this was written, a
100MHz 486 was an impressively fast CPU - but not everyone will be, so
obviously this would need to be very optional.

It would blow me away if nobody else has done anything like this, but
I haven't researched it.

--
73,
Ged.

Dave Mitchell

2024-02-23 11:14:19 UTC

Permalink

Post by Eric Wong
I'm not an expert in Perl internals, but my reading of sv.c
doesn't reveal arenas getting freed outside of process teardown.

Correct.

Post by Eric Wong
With long-lived processes, permanent allocations made well after
a process enters steady state tends to cause fragmentation.
(steady state being when it's in a main loop and expected to
mainly do short-lived allocations)

I don't quite see how freeing arenas is going to help much with
fragmentation.

First, in order to be a candidate for freeing, the arena would have to
become completely empty - just one long-lived SV head or tail allocated
late on would stop that.

And if arenas do become empty, what's wrong with holding on to them for
future use? Any long-lived process is going to need to allocate and
release some SVs each tine it performs any sort of activity, so why not
hang onto that existing arena rather than freeing/unmapping it and then
having to almost immediately allocate a new one in the next burst of
activity?

I suppose it could be argued that freeing arenas would be useful for a
process that has a huge start-up footprint which it no longer needs once
reaching a steady state.

--
In my day, we used to edit the inodes by hand. With magnets.

Eric Wong

2024-02-23 18:28:31 UTC

Permalink

Post by Dave Mitchell

I don't quite see how freeing arenas is going to help much with
fragmentation.

Immortal + unused arenas prevents consolidation of free space into
larger chunks by the malloc implementation. So if a short-lived
~4k chunk gets allocated and it neighbors an immortal ~4k arena
chunk, the space used by the short-lived chunk cannot later be
consolidated and reused if a larger (e.g. ~8k) allocation is
needed later on.

When a malloc implementation can't consolidate free space to
satisfy a larger allocation, it must request more memory from
the OS.

I'm going off dlmalloc behavior since that's the basis of glibc
malloc which behaves the same way:
https://gee.cs.oswego.edu/dl/html/malloc.html

Using gigantic arenas (I think >=64M for glibc) would force
mmap use and avoid the problem; but that's not suitable for
short-lived scripts.

Perl arenas are just part of the problem I observe. AFAIK
there's also, stuff internal to Perl (magic, pads, etc.), stuff
pinned to `state' variables, per-library/application caches,
etc.

Post by Dave Mitchell
First, in order to be a candidate for freeing, the arena would have to
become completely empty - just one long-lived SV head or tail allocated
late on would stop that.

Yes, that's a related and known problem; especially with cold
code paths and internal memoization or long-lived caches used by
Perl. I've already gotten rid of most lazy/late memoization
in a codebase I maintain.

Post by Dave Mitchell
And if arenas do become empty, what's wrong with holding on to them for
future use? Any long-lived process is going to need to allocate and
release some SVs each tine it performs any sort of activity, so why not
hang onto that existing arena rather than freeing/unmapping it and then
having to almost immediately allocate a new one in the next burst of
activity?

As mentioned above, holding onto them prevents coalescing by
leaving holes in the free space. This is worse when short-lived
allocation sizes are variable and unpredictable; and the worst
cases (largest size) happens late.

Then the allocator is forced to get new space (via sbrk||mmap);
and then it can never release that new space because some
of it eventually got used by an arena.

I'm not too familiar with what Perl does internally. It seems
stuff like building big short-lived strings via .= and some
regexps will still trigger long-lived allocations. I couldn't
find too much in perlguts about it.

Dave Mitchell

2024-02-23 18:59:35 UTC

Permalink

Post by Eric Wong

Post by Dave Mitchell
I don't quite see how freeing arenas is going to help much with
fragmentation.

Immortal + unused arenas prevents consolidation of free space into
larger chunks by the malloc implementation.

Yeah, I understand the general behaviour of a decent malloc() library.

I'm just failing to understand how it applies to perl.

A typical string SV consists of 3 items: a fixed SV head, which points to
one of about 16 types of SV body (which are different sizes based on
whether the SV holds an int, a double, or string, a reference, an array,
or whatever); then the body of a string SV points to a malloc()ed string
buffer, which is likely over-allocated to be more than the current length
of the string.

When a string SV is finished with, e.g. after the pop in:

push @a, "....some string...";
...
pop @a;

then the string buffer is free()ed, while the SV head is returned to the
pool of spare SV heads (a linked list of free heads meandering in a random
order through the all the allocated head arenas), while the SV's body is
is returned to one of the 16 body pools.

If a string is grown, e.g. via $x .= "....", then if there is spare
head room in the allocated string buffer, it is used, otherwise the string
buffer is realloc()ed, with the new size being specified by some formula
involving the new size plus a certain extra proportion for future
expansion.

Under some circumstances a string buffer may be shared among multiple SVs
(COW).

Your proposal (IIUC) is that for the SV head and body arenas, if the
releasing of an SV head or body causes the particular 4K (or whatever)
arenas to be completely empty (all the heads/bodies in it have been
freed), then we should free() that 4K block?

As I said earlier, the two problems with that are that, firstly, it is
likely rare that an arena will ever become completely empty. For example,
in this hypothetical code:

my @timestamps;
while (1) {
my @temp;
# do some processing
for (1..1000) {
push @temp, ....;
}
...
push @timestamps, time;
}

a thousand temporary SVs are allocated, then one long-lived SV, then the
1000 SVs are freed. This will leave the SV arena nearly, but not
completely, empty. So it can't be free()ed.

Secondly, even if arenas could be freed, so what? You free the 4K block.
Perl will almost immediately require a new SV head or body, because
that's what perl does - just about all internal operations are based around
SVs. So if there aren't any spare heads, it will soon trigger a fresh 4K
arena malloc(). So nothing's been consolidated, you've just wasted time
calling free() and then malloc() on a same-sized chunk.

--
Nothing ventured, nothing lost.

Eric Wong

2024-02-23 21:59:43 UTC

Permalink

Post by Dave Mitchell
Your proposal (IIUC) is that for the SV head and body arenas, if the
releasing of an SV head or body causes the particular 4K (or whatever)
arenas to be completely empty (all the heads/bodies in it have been
freed), then we should free() that 4K block?

Yes, at least that is one possible way to go about this...
Using anonymous mmap for arenas and/or forcing a larger arena
size would be another. (glibc actually has 32MB as the max mmap
threshold on 64-bit)

Post by Dave Mitchell
As I said earlier, the two problems with that are that, firstly, it is
likely rare that an arena will ever become completely empty. For example,
while (1) {
# do some processing
for (1..1000) {
}
...
}
a thousand temporary SVs are allocated, then one long-lived SV, then the
1000 SVs are freed. This will leave the SV arena nearly, but not
completely, empty. So it can't be free()ed.

Right, having a lingering allocation in a larger chunk is bad
situation for all allocators. However (IIUC), each 4080-byte
arena only holds 169 (or 170?) SVs on 64-bit systems. Thus some
arenas would get freed in your above example (but that can be bad
as you say below)

Post by Dave Mitchell
Secondly, even if arenas could be freed, so what? You free the 4K block.
Perl will almost immediately require a new SV head or body, because
that's what perl does - just about all internal operations are based around
SVs. So if there aren't any spare heads, it will soon trigger a fresh 4K
arena malloc(). So nothing's been consolidated, you've just wasted time
calling free() and then malloc() on a same-sized chunk.

My observation is allocation spikes are a freak event and enough
can be freed to discard unnecessary arenas.

You're right that a free+malloc immediately is almost always a
waste. And 4080-bytes is a tiny chunk and it's easy to trigger
multiple wasteful free+malloc sequences with them.

The only possible benefit of such a wasteful free+malloc
sequence is it ends up migrating a (semi-)permanent allocation
to a more favorable location adjacent to other (semi-)permanent
allocations and farther away from the "wilderness" in dlmalloc
nomenclature.

Thus using anonymous mmap (and omitting munmap at runtime) might
be the best way to go; and that probably doesn't involve Perl
calling mmap directly at all:

Now, I'm thinking exposing PERL_ARENA_SIZE as a runtime env knob
would the best way to go avoiding fragmentation.

I don't expect most users would want to recompile their own Perl
or maintain multiple Perl installations. Having an easily tunable
PERL_ARENA_SIZE would allow users to force a size which triggers
mmap on their platform.

ARENAS_PER_SET would have to be determined runtime, though...

Sidenote: using small 4096-byte arenas with mmap would be nasty
since it can hit low default of vm.max_map_count sysctl on Linux.
Using mmap would force the use of bigger arenas with it.

Eric Wong

2024-04-01 20:16:22 UTC

Permalink

Using jemalloc as an LD_PRELOAD on GNU/Linux seems to be a good
solution in my testing the past few weeks. I theorize the
jemalloc idea of sacrificing up to 20% space up front to reduce
granularity pays off for long-lived daemons dealing with many
variable-sized strings. (jemalloc(3) manpage has more details)

I'm testing the size class idea on glibc, too, because
recommending users use an LD_PRELOAD or recompile Perl isn't
workable (getting them to run something written in Perl is
already a monumental task :<).