Introduction

This article is for the #DebuggingFeb writeathon contest by Hashnode. Every production issue is a drama. Here I present you one such drama in a five-act drama format.

Act 1: The Exposition

"Here, the audience learns the setting (Time/Place), characters are developed, and a conflict is introduced."

I was working on a product that offers 'Email Security' as a service. This was at a time before the advent of cloud services like AWS, GCP, etc. The engineering and operations teams were responsible not only for developing and supporting the product features but were also involved in the maintenance of the underlying infrastructure.

We had around five data centers across the globe. Customers would be assigned to one of these data centers. Email traffic from the customer's domains would be routed to our data center, where we scan the email messages for Spam, Virus, and Phishing content and route the clean messages back to the customer's mail server.

There was a project to upgrade hardware and versions of software technologies that we use across our product. Each team was assigned different components. One of the components assigned to my team was an API implemented as an Apache mod_perl module that gets embedded inside an Apache HTTPD server.

During the email scanning process, messages arrive in a queue. For each message in the queue, the queue consumer calls our API and uses the response in further processing of the message. If our API takes more time to respond, it would increase the overall processing time of each message that we scan.

We upgraded to use the latest versions of Perl, Apache HTTPD server and the operating system (OS) version of the server in which our API runs. The physical hardware in which our servers run as a virtual machine (VM) was also upgraded. A performance test was done in the QA environment to come up with an estimate for the number of VMs to be deployed in PROD for each region.

The upgrade in the first data center went smoothly. After a week, we upgraded our second data center. This is when things started to turn bad for us.

Act 2: Rising Action

This act leads the audience to the climax. It is common for complications to arise, or for the protagonist to encounter obstacles.

While the first data center we upgraded was our smallest in terms of traffic volume, the second data center was our highest. During peak hours, we started to see delays in the roundtrip time. We were taking more time to scan each message. Customers started to complain about delays in receiving and sending messages. The component which calls our API was taking more time to process each message and there was a backlog of messages in the queue it consumes. It was clear that our API was taking more time to process each request and was the bottleneck.

With the upgraded versions of our software stack and underlying hardware, we were expecting improvements but instead got a nasty surprise. Since this was an upgrade of our software stack, rollback also was not easy. We cannot install older versions of our code in the new VMs running newer versions of Perl, Apache HTTPD and OS. We could bring up the older VMs running in older hardware and route the requests to the older version of our API. But this rollback would be very time and effort-consuming We had to decide quickly if we wanted to take this route.

Act 3: The Climax

This is the turning point of the play. The climax is characterized by the highest amount of suspense.

There were not many code changes in the release. We only fixed a few minor bugs along with updating the software stack. So this performance degradation was a surprise. I needed a way to see, where in our code, time was being spent while processing each request.

I tried to look into our API logs to identify which part of the request was taking time. One method I always use in such scenarios is to turn on debug logs and take a look at the timestamp of all the log lines from a single request. Let's say for each request there are five log lines and the difference in time between the third and fourth is high. This indicates that the problem is somewhere in the code between the third and fourth lines. If we have the log lines in the correct places, this would help in narrowing down the problem. But in this case, I could not identify the problem with this approach. This was when I turned to 'strace' for help.

'strace' is a debugging tool in Linux, which attaches itself to a process, and intercepts and logs each system call executed by the process. It can log each system call executed, along with the input parameters, the output and the time it takes to execute each system call. Whatever the language in which a program may be written, in the end, when it is executed, it is going to execute a subset of system calls that the underlying operating system supports. For example, consider the Java and Python programs below.

import java.io.File;

public class TestReadDir
{
    private static final String MAILS_DIR = "/storage/";
    private static final String[] users = {"user1", "user2", "user3"};

    public static void main(String[] args)
        throws InterruptedException
    {
        while (true)
        {
            Thread.sleep(5000);
            for (String user : users)
            {
                String path = MAILS_DIR + user;

                File[] files = new File(path).listFiles();
                System.out.println("Count of files in " + path + " is " + files.length);
            }
        }
    }
}

# import OS module
import os
import time

MAILS_DIR = "/storage/"
users = ["user1", "user2", "user3"]

while True:
    time.sleep(5)
    for user in users:
        path = MAILS_DIR + user
        file_list = os.listdir(path)
        print("Count of files in ", path, " is ", len(file_list))

Both programs have the same functionality. For a given list of users, they construct a directory path and count the number of files in that directory. Now take a look at the output from 'strace' for both programs. Details on how to attach 'strace' to a running process can be found here.

'strace' output for the Java program:

[pid 22097] open("/storage/user1", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 8 <0.000506>
[pid 22097] fstat(8, {st_mode=S_IFDIR|0755, st_size=6144, ...}) = 0 <0.000014>
[pid 22097] getdents(8, /* 3 entries */, 32768) = 80 <0.000015>
[pid 22097] getdents(8, /* 0 entries */, 32768) = 0 <0.000013>
[pid 22097] close(8)                    = 0 <0.000014>
[pid 22097] write(1, "Count of files in /storage"..., 44) = 44 <0.000015>
[pid 22097] write(1, "\n", 1)           = 1 <0.000027>
[pid 22097] open("/storage/user2", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY <unfinished ...>
[pid 22100] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out) <1.000071>
[pid 22100] futex(0x7f6a400d3728, FUTEX_WAKE_PRIVATE, 1) = 0 <0.000016>
[pid 22100] futex(0x7f6a400d3778, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=999999065} <unfinished ...>
[pid 22097] <... open resumed> )        = 8 <0.000359>
[pid 22097] fstat(8, {st_mode=S_IFDIR|0755, st_size=6144, ...}) = 0 <0.000015>
[pid 22097] getdents(8, /* 3 entries */, 32768) = 80 <0.000017>
[pid 22097] getdents(8, /* 0 entries */, 32768) = 0 <0.000015>
[pid 22097] close(8)                    = 0 <0.000016>
[pid 22097] write(1, "Count of files in /storage"..., 44) = 44 <0.000031>
[pid 22097] write(1, "\n", 1)           = 1 <0.000121>
[pid 22097] open("/storage/user3", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 8 <0.000348>
[pid 22097] fstat(8, {st_mode=S_IFDIR|0755, st_size=6144, ...}) = 0 <0.000015>
[pid 22097] getdents(8, /* 3 entries */, 32768) = 80 <0.000017>
[pid 22097] getdents(8, /* 0 entries */, 32768) = 0 <0.000015>
[pid 22097] close(8)                    = 0 <0.000015>
[pid 22097] write(1, "Count of files in /storage"..., 44) = 44 <0.000028>
[pid 22097] write(1, "\n", 1)           = 1 <0.000262>

'strace' output for the Python program:

open("/storage/user1", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 3 <0.000527>
fstat(3, {st_mode=S_IFDIR|0755, st_size=6144, ...}) = 0 <0.000019>
getdents(3, /* 3 entries */, 32768)     = 80 <0.000020>
getdents(3, /* 0 entries */, 32768)     = 0 <0.000020>
close(3)                                = 0 <0.000020>
write(1, "('Count of files in ', '/storage"..., 59) = 59 <0.000019>
open("/storage/user2", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 3 <0.000399>
fstat(3, {st_mode=S_IFDIR|0755, st_size=6144, ...}) = 0 <0.000017>
getdents(3, /* 3 entries */, 32768)     = 80 <0.000017>
getdents(3, /* 0 entries */, 32768)     = 0 <0.000016>
close(3)                                = 0 <0.000017>
write(1, "('Count of files in ', '/storage"..., 59) = 59 <0.000016>
open("/storage/user3", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 3 <0.000335>
fstat(3, {st_mode=S_IFDIR|0755, st_size=6144, ...}) = 0 <0.000025>
getdents(3, /* 3 entries */, 32768)     = 80 <0.000012>
getdents(3, /* 0 entries */, 32768)     = 0 <0.000012>
close(3)                                = 0 <0.000011>
write(1, "('Count of files in ', '/storage"..., 59) = 59 <0.000014>

You can see that the output from 'strace' for both programs is similar. For each directory, 'open', 'fstat', 'getdents' and 'close' system calls are executed. The last column in each line of the output is the time taken for that system call in seconds. You can see the 'getdents' system call, where we go through the contents of a directory and find out the number of files in that directory.

'strace' tears apart your program and provides all the details of the actual system calls that are executed. If you have not played around with 'strace' before, I strongly suggest looking at the 'strace' output of the code you are currently working on. You will have a much better understanding of what your code is doing.

Now, coming back to my production problem, I attached 'strace' against our API component and recorded the timings of all the system calls that were being executed. I went through the output and looked at which system calls were taking the most time. And there it was. I noticed one of the system calls was consistently taking a few seconds to execute.

Act 4: Falling Action

The opposite of Rising Action, in Falling Action the story is coming to an end, and any unknown details or plot twists are revealed and wrapped up.

The problematic system call was 'getdents'. For each request, one of the steps the API did was doing something very similar to the sample programs I listed above. It was constructing a directory path and finding the number of files in that directory. The directory in question was in fact, an NFS mount point.

Now the question was, why is this operation taking long after we did the upgrade? What changed? I googled around and found a useful command called 'mountstats' that would give you some metrics around the operations happening on an NFS mount point. The output of 'mountstats' looks like the one below. For each NFS operation, it gives metrics like the total count, time taken per operation, total time spent etc.

$ cat /proc/self/mountstats
        per-op statistics
                NULL: 0 0 0 0 0 0 0 0
                READ: 23 66 46 46 56745 4576 1340 140
                WRITE: 0 0 0 0 0 0 0 0
                CREATE: 0 0 0 0 0 0 0 0
                MKDIR: 0 0 0 0 0 0 0 0
                READDIR: 0 0 0 0 0 0 0 0
                READDIRPLUS: 3453 5674 0 3525 35235 34525 57457 4577  
                FSSTAT: 0 0 0 0 0 0 0 0

I ran this command in the upgraded data center and also in another data center where we did not perform the upgrade. I found one crucial difference in the output. In the upgraded data centers, we were performing 'READDIRPLUS' operation on the NFS mount point. But in other data centers, we were performing 'READDIR' operations. It looked like something changed around NFS mount points. This was very much possible since we upgraded the underlying OS as well. My next stop for help was the man pages of NFS.

It turns out that, in newer versions of NFS, if we read a directory, the NFS client issues a 'READDIRPLUS' request to the NFS server. While in older versions the NFS client issues a 'READDIR' request to the NFS server. The difference between the two is that 'READDIRPLUS' in addition to fetching the list of files in that directory, also fetches all the attributes of each file. In other words, a 'READDIR' would only cause an 'ls' command to be executed in the NFS server. But 'READDIRPLUS' would also cause a 'stat' command to be executed for each file in the 'ls' output.

Act 5: Resolution

This is the final outcome of the drama. Here the author's tone about his or her subject matter is revealed, and sometimes a moral or lesson is learned.

'stat' command is a very costly operation. Our NFS server, which was a Netapp filer, was not able to handle this additional load and thus the processing times increased in our API. We needed a way to disable the 'READDIRPLUS' operations and revert to the old behavior. Our API only needs to find the number of files present in a directory. The attributes of each file were not needed.

Again the man pages provided the solution. There was an NFS mount option "nodirplus" provided to change the default behavior and use only 'READDIR'.

See the below snippet from the man pages.

rdirplus / nordirplus

Selects whether to use NFS v3 or v4 READDIRPLUS requests. If this option is not specified, the NFS client uses READDIRPLUS requests on NFS v3 or v4 mounts to read small directories. Some applications perform better if the client uses only READDIR requests for all directories.

We added this option to our '/etc/mnttab' file and remounted the remote file system. We started to see immediate improvements and the response time of our APIs was back to normal.

One important lesson for me in all this was that we need to be aware of the differences between QA/DEV environment and the PROD environment. In this case, though we had similar NFS mount points in QA/DEV, we were not using a Netapp filer. We just mounted the file system from another Linux VM in our API servers. It looks like these Linux VM were able to handle the additional load from the 'stat' command better than the Netapp filer which was probably an older piece of hardware. Being aware of the differences would help in narrowing down the root cause of the problem.

Hope you all enjoyed this drama. Reviews are welcome :) Please follow me for more such articles.

#DebuggingFeb

strace - Blow up your application's cover

Table of contents

Introduction

Act 1: The Exposition

Act 2: Rising Action

Act 3: The Climax

Act 4: Falling Action

Act 5: Resolution

strace - Blow up your application's cover

Table of contents

Introduction

Act 1: The Exposition

Act 2: Rising Action

Act 3: The Climax

Act 4: Falling Action

Act 5: Resolution

Did you find this article valuable?