Monitor Lambda Container Usage

AWS Lambda is compute on demand where the deployment unit is a function rather than an application. It scales on demand, or as Amazon puts it:

You do not have to scale your Lambda Functions – AWS Lambda scales them automatically on your behalf. Every time an event notification is received for your Function, AWS Lambda quickly locates free capacity within its compute fleet and runs your code.

With this paragraph, Amazon gives you an indication to how Lambda scaling works without spelling it out. The key phrase is the last sentence. Let's rephrase it a little bit:

Every time an event is received, AWS Lambda locates a free container and runs your code.

If we take this at face value it means that for every request, any free "fleet capacity", or container, will run the Function. What does this really mean? That's the question I set out to answer.

Since we rely on Amazon to scale Lambda, it is interesting to understand how it scales, especially since Lambda can suffer from a phenomenon commonly referred to as "Cold Starts". A cold start is when a Lambda Function is invoked and a new container has to be initiated before it can process the request. This usually incurs a slight delay in execution, the length of which largely depends on how much memory you have allocated to your Function, what runtime you use and how large your source code package is.

Even more important is that once you understand the quote above, you will realize that cold starts are not only affecting Lambdas that are infrequently called but every Lambda. The key factor is not request frequency but request patterns. I will demonstrate this with a simple example.

Testing Lambda Container Usage

Consider this node.js example. It will try to read a file from /tmp, a file area that is accessible across Lambda invocations assuming that subsequent invocations use an already existing container. If the file does not exist, it will create it and return "Miss". If the file exists it will just return "Hit". I've added a slight delay to the Function, 170 ms, to ensure that I can invoke it several times in parallel.

            var fs = require('fs');
            var file = '/tmp/container-testing';
            var minDuration = 170;
            exports.handler = function (event, context) {
                var start = Date.now();
                fs.stat(file, function (error, stats) {
                    if (error) {
                        if (error.message.indexOf('ENOENT') > -1) {
                            fs.writeFile(file, 'Hello world', function (error) {
                                if (error) {
                                    return respond(error);
                                }
                                respond(null, 'Miss');
                            });
                        } else {
                            return respond(error);
                        }
                    } else {
                        respond(null,'Hit');
                    }
                });
                function respond(error, result) {
                    var now = Date.now();
                    var wait = minDuration - (now - start);
                    setTimeout(function () {
                        console.log('Result:', result);
                        context.done(null, result);
                    }, wait);
                }
            };
        

I've deployed an API in API Gateway to serve as the access point for the Lambda Function. This way, I can use simple load testing tools such as Siege to invoke the Lambda.

As you probably know, Lambda is directly integrated with CloudWatch Logs. For each Function, a new Log Group is automatically created and the Function sends all logs to Log Streams in this Log Group throughout its lifetime. These Log Streams get very interesting when you start paying attention to them. They are the key to understanding Lambda scaling.

In the quote, Amazon stated that Lambda will look for available capacity for every request. To test this, let's invoke the Lambda via the API endpoint with a concurrency of 1:

            siege -c 1 https://abc123.execute-api.eu-west-1.amazonaws.com/test
        

As can be seen from the screenshot we have exactly one Log Stream in the Cloud Watch Log Group. This Log Stream has exactly one log message that says "Miss", the rest says "Hit". It looks like we have exactly one container handling all requests. What happens if we fire the same test with a concurrency of 2?

            siege -c 2 https://abc123.execute-api.eu-west-1.amazonaws.com/test
        

Now we have two Log Streams, the old one and a new one. Still, both these streams contain the "Miss" log exactly once. This indicates that we now have 2 containers handling the requests.

If we try with 5, 10 or 20 concurrent streams we get the exact same behaviour. For each concurrent request, we see one more Log Stream with the "Miss" message and each old Log Stream only contain the one "Miss" from when the Stream was created.

At this point I'm ready to come to the conclusion that there is a direct correlation between Lambda containers and CloudWatch Log Streams and by conducting further tests with different concurrency up and down it becomes more and more obvious that this is in fact the case. It's not even far fetched to assume that the seemingly random Log Stream names are in fact container ids, especially if you consider that [$LATEST] is the invoked alias of the Lambda Function.

Monitor Container Usage

Now that we are fairly certain that a Log Stream corresponds to a container we can monitor the number of active containers by looking at the streams and comparing LastEventReceived to the current time. The number won't be exact since we don't know precisely how long a container is kept alive but if we assume that Lambda will keep a container for 10 minutes we can check the number of streams that received log events in the past 10 minutes at regular intervals and report this to CloudWatch as a custom metric.

I've written a Lambda that does exactly that. You can find the source code on github.

Container Lifespan

As a side note, I got curious about how long a Lambda container can live if it is being actively called over a longer period of time. I have heard rumors that Lambda killed containers after 15 minutes regardless of how active it was so I decided to run a test to see if this is really the case.

Again, I picked up Siege and ran a test with a concurrency of 1 and a delay of 60 seconds.

siege -c 1 -d 60 https://abc123.execute-api.eu-west-1.amazonaws.com/test

I ended the after having kept one container running for almost 4 hours. The rumors about a 15 minute max lifespan turned out to be false. Lambda containers can be long lived. Perhaps I'll run another test to see exactly how long lived they can be and get back with an update.

Icons made by Madebyoliver