[metapost] More on memory leaks

mskala at ansuz.sooke.bc.ca mskala at ansuz.sooke.bc.ca
Wed Oct 5 19:08:40 CEST 2011

As described in my earlier message, the current SVN version of MetaPost
contains serious memory leaks which I'd like to fix so that I can work
on the serial number overflow problem, which is a showstopper for one of
my projects.  (See
http://sethgodin.typepad.com/seths_blog/2005/03/dont_shave_that.html )

I did binary searches on the versions in the repository and found that
revision 1508 is the first one with the memory leaks; then somewhere
between 1538 and 1591, the leaks got a fair bit worse.  From examining the
diffs between the revisions, it's clear that the leaks are associated with
the dynamically allocated mp_number_data structure.  Every time a number
is created throughout MetaPost, one of these structures gets dynamically
allocated.  Very little effort is made to ever de-allocate them.  On my
test program (which is just a few nested loops with an assignment inside)
MetaPost leaks about 25 mp_number_data structures per iteration, filling
the 32-bit memory space in about 4 million iterations.  I'm really quite
surprised that this code ever made it to check-in, let alone remaining
unfixed for more than half a year.

I instrumented the code to write a line to standard output every time it
allocates or de-allocates an mp_number, stating the address of the object
and the line number where the (de)allocation occurs.  Then I used Perl to
match allocations with deallocations and show me the line numbers on which
there were objects allocated that never got deallocated, and how many such
objects on each line.  That gave me some idea of where the leaked memory
was being allocated, and the relative speeds of the different leaks,
although it still wasn't clear where it ought to be deallocated.

I've tried to go through and make it de-allocate the mp_number_data
structures whenever it de-allocates the parent mp_node_data structures.
That's not easy because not every mp_node contains an mp_number (and some,
such as colors and transforms, seem to contain more than one).  The entire
system seems to be trying to fake OOP in the non-OOP C language.  The
mp_free_node function seems to be the reasonable place to do the
deallocation, since it deallocates the parent mp_node_data; but
mp_free_node is doing fake polymorphism and it has to know how to
deallocate many different kinds of mp_nodes, including ones that don't
contain mp_numbers.  So it can't just unconditionally deallocate the
"data" field because that might not exist.

I experimented with examining the type field of the node to try to
determine whether there is a number that needs to be deallocated.  That
seems to be a losing game because nodes change types during their
lifetimes and I can't find any simple summary of which type codes
correspond to having, or not having, dynamically allocated number
structures inside.  For instance, many nodes containing leaked numbers
are allocated by mp_value_node_type and given (at that time) the type
code mp_value_node_type; but by the time they are freed, they have been
assigned some other type, so checking for mp_value_node_type in
mp_free_node and de-allocating the number field whenever I see that, has
little or no effect on the memory leaks.  Type codes that correspond to a
node containing a number seem to *mostly* but not *entirely* correspond to
contiguous ranges of type code numbers, and it became clear to me that as
well as being a lot of work, any solution based on looking for magic type
codes in mp_free_node would be extremely fragile (as type codes are
added and rearranged) and not a good way to do it.

The mp_free_node function does get one other clue to what kind of
structure it's de-allocating, because it is passed an argument telling it
the size of the structure.  I briefly considered checking that size to see
if it happens to be the size of a node type containing a dynamic number
(there are many fewer distinct sizes than distinct types); rejected that
idea because it's quite possible there could be two types of nodes
differing in this respect but happening to have the same size.  That gave
me a clue, though, on a better solution: mp_free_node is generally called
through a wrapper like mp_free_value_node, which *does* know whether the
node is a node of the kind that contains a dynamically allocated number.
So I added code to all those wrapper functions (and all the relevant
unwrapped calls to mp_free_node directly) to de-allocate the number(s)
before calling mp_free_node.

That change reduced the leaked memory by about 60%.  Almost all the
remaining major leaks were in the parsing code, where many functions
allocate mp_node_data structures on the stack instead of with get_node,
and then manually call new_number.  Then they correspondingly don't call
mp_free_node, and they don't call free_number.  I tried to add free_number
calls, but that was very difficult because many of these functions also
return at multiple points, and their implementations are spread across the
file in multiple chunks instead of all being in one place, as a result of
"literate programming."  It's not generally possible to look at a function
and know all the places it might return when the code is "literate."  I am
not enthusiastic about "literate programming" and this is one reason why.

The one remaining major leak, not falling into the previous two
categories, is associated with symbols_entry nodes.  The new_symbols_entry
function calls new_number, but the delete_symbols_entry function can't
call free_number because it does not have the "mp" argument that's passed
to nearly all other functions in the program, and that argument is needed
by free_number.  I am not sure that I can just add an mp argument to
delete_symbols_entry, because delete_symbols_entry is a callback function
passed into the AVL tree code, and it's not clear that I can get away with
changing the calling convention for such a function short of changing all
the AVL tree code.  There is also a rather disturbing comment in the code
suggesting that delete_symbols may never actually get called until the end
of the run anyway, so that there would be a leak in practice even if it
gets cleaned up before the actual termination of the program. At first
glance the problem of passing an extra argument into the callback is
crying out for a programming language construct called a "closure," which
doesn't exist in C, but the deeper issue is that the code is trying to do
OOP in a non-OOP language, and suffering the consequences of doing OOP
badly.  I decided, just for the moment, not to try fixing this point, and
to hope that stemming the other leaks would be enough to let me look at
the actual problem I want to work on.

With an attempt made at calling free_number on all the numbers inside
stack-allocated mp_node_data structures in the parsing code, my test
program now runs long enough to crash for reasons other than out of
memory.  Unfortunately, that's still only about four million iterations.
It crashes with bizarre, clearly incorrect errors:  complaining that
numbers are greater than the maximum permitted scaled value of about 4000,
when I'm sure my code did not actually compute any such numbers, and/or
complaining about syntax errors and displaying corrupted versions of my
program source.  My guess is that I touched something I shouldn't have in
the "free stack-allocated number structures" change, de-allocating
something that was still being used, so that it's stomping on memory and
putting incorrect data into variables and/or the program text itself.
(The facts that some of this corruption is of the program text, and
occurred after I changed the parsing code, seems to support this guess.)
Eventually the corruption piles up to the point that it crashes.
Another possibility, though I don't think this is so likely, is that I'm
finally looking at the actual consequences of the serial-number bug.
(Unlikely because revisions prior to 1508 can get farther without this
problem showing up, and the serial-number bug should show itself instead
as a fatal message saying the system has run out of serial numbers.)

At this point, one possibility would be to go deeper into the parsing-code
changes and try to root out where the corruption is occurring.  I'm
starting to think that is the wrong way to go, however.  Here are some

* If an object of a given type contains a dynamically allocated
  mp_number_data structure, then objects of that type ALWAYS contain such
  a structure.
* Moreover, objects of any given type always contain the same number of
  mp_number_data structures.
* mp_number_data structures live (or, in an ideal world without memory
  leaks, SHOULD live) exactly as long as their parent objects.
* All mp_number_data structures are the same size (at least when allocated
  by the new_number macro, which hardcodes them to be of "scaled" type)

structures just contained the mp_number_data structures directly, instead
of containing pointers to them, then there would be no need to manage the
pointers.  As well as reducing the demand on programmers to de-allocate
everything that gets allocated (with the resulting load on the fake-OOP
framework to track the information to make deallocation possible), having
them be directly part of the parent objects instead of separately
allocated would save time (all those alloc and de-alloc calls, as well as
pointer dereferences every time the structures are used) and memory (the
malloc bookkeeping of additional objects).  The basic rationale for
dynamic allocation (being able to accomodate objects whose numbers and
sizes are unpredictable) doesn't seem to apply here because the number
data structures always go with their parents anyway, and we are already
managing the parents dynamically.

So in short:  I think what needs to be done is to roll back the change
that occurred between revisions 1507 and 1508.  Making these number
structures dynamic is not actually a good idea after all, at least not
until all parts of the code can really be trusted to de-allocate
everything that is allocated (which will be difficult to achieve both
because of human frailty and the cumbersome fake-OOP system baked into
the existing code).

Just for my own purposes it may be easiest for me to simply check out a
copy of revision 1507 and do my bug-fixing against that instead of the
latest and greatest.  It might be better for the world in general that I
check out the head revision, change all the number data structures back to
be parts of their parents, and then go to work on bug-fixing, because then
my serial number bug fix will be easier to share; but I don't want to be
working at cross-purposes with whoever introduced the dynamic allocation,
if they have a good reason I'm not aware of for making the number
structures dynamic, and I'm not eager to spend a lot more of my time on
fixing memory leaks when those were not my original interest.
Matthew Skala
mskala at ansuz.sooke.bc.ca                 People before principles.

More information about the metapost mailing list