All Articles

An interesting memory leak...

This week we finally got around to fixing a memory leak we’ve been aware of for a while now.

We have a few different APIs that power Datawallet (both internal and external facing) and these are each deployed as clusters of Docker containers sitting on top of EC2 machines. One of the best things about this set up is easy zero downtime deployments where new Docker containers are deployed and registered as healthy before the old ones are killed.

Usually our metric graphs are set to show us activity for the last 3 - 7 days and as we deploy so often, it can be hard to catch memory leaks on those graphs.

Here is what the memory utilization graph looked like for our user-api cluster on any given day:

one-day.png

However zooming out a bit to show the last couple of months:

two-months.png

The memory leak became immediately obvious. Memory usage seemed to be rising consistently 4-6% every day and then dropping whenever we deployed.

So how did we fix this… The first step was to run the user-api locally and inpect the heap. Because our APIs are built using Node.js it’s easy enough to use the --inspect flag and look at the dump using the the Chrome dev tools.

We decided to immediately load test the user-api by using autocannon to send a multitude of POST requests to one of our endpoints. The following screenshot shows the size of the heap dumps on start up, after an initial load test and then another.

dump-sizes

We can see the heap size increasing around 10mb per load test which definitely shouldn’t be happening.

Doing a comparison between the snapshots we saw that the size of the string constructor was growing hugely between tests. Showing a lot of stringified errors.

errors

Cutting the story a bit short, this was hugely confusing because our first thought was that errors weren’t being garbage collected. However we were unable to find any sneaky closures keeping references to them. We even tried running node with the --expose-gc flag, then manually calling global.gc(); after our load tests. This still didn’t cause heap size to decrease.

After some digging through google and searching specifically for Global handles and GC roots we found a comment at the end of an old Google groups mailing list.

comment

This seemed magic so we replaced the console.log (well, a console.error in our case) in our API:

diff

and found that heap size remained reasonably stable throughout our load tests.

fixed-heaps

This was great news! But a bit hard to understand, we couldn’t figure out why the console module would hold on to references or even digging through the console source code where these references would be kept.

One reason this is a hard problem to Google for, is that there are a number of people who search/ask about the term node js console memory leak. However they have a fundamentally different issue, consider the following code:

for (let i = 0; i < Infinity; i++) {
  console.log('Hello: ' + i);
}

This tight for loop blocks the event loop from processing and by extension stops the garbage collector from running, eventually leading to the heap out of memory issues we all know:

crash

However this thread also mentions the dev console being the one holding references, so we abandoned the --inspect flag for a moment and used process.memoryUsage() to check the heap size after every load test:

Heap size: with --inspect
Heap size (mb): 5194.4921875
Heap size (mb): 21870.1796875
Heap size (mb): 21383.625
Heap size (mb): 28858.96875
Heap size (mb): 36361.6796875
Error: without --inspect
Heap size (mb): 7686.3046875
Heap size (mb): 12105.3046875
Heap size (mb): 4949.1328125
Heap size (mb): 12758.40625
Heap size (mb): 21043.5703125
Heap size (mb): 4955.046875

Our so called leak has disappeared! How annoying, but in hindsight this makes sense. There’s no way there could be so much code running in the world with the console module leaking so badly, so we were back to square one.

After hitting a few more commonly used endpoints we found what appeared to be another leak!

leak-2-heap

On this endpoint we see that memory is increasing by a huge amount, and doing a comparison we see that the string constructor is being filled with some stringifed function with the signature function(parsers,rowData /*``*/).

After a bit of searching through the Sequelize library and then the pg module, we found the following line of code:

this.RowCtor = Function('parsers', 'rowData', ctorBody)

Which looks a lot like the function signature in our dev tools. Some more looking revealed an issue where they were experiencing the exact same thing as us. The fix had been merged about a month ago so we bumped our pg version:

pg-version-bump

And now our user-api cluster had reasonably consistent memory consumption, no more leak!

fixed-cloudwatch-logs

Published 1 Jun 2018

Notes from the Datawallet tech team.
Datawallet on Twitter