Troubleshooting Kubernetes FailedAttachVolume and FailedMount

Simple Kubernetes workloads can sometimes fail and be easily restarted by the kubelet to a clean state without any problem. Nontrivial workloads (for example, when containers need to persist a state or share files with other containers) need a way to recover their previous states whenever they restart.

Persistent Volumes provide an API that allows Kubernetes administrators to manage volumes in a safe and abstracted way, without them needing to understand the nitty-gritty of different storage providers. It also provides a convenient way for Pods to store necessary states to perform their tasks.

When working with Persistent Volumes, two common issues often seen with Kubernetes are Failedattachvolume and Failedmount. These errors generally mean there was a failure using the desired volume, which, in turn, prevented workloads from functioning as intended.

Since there can be many different reasons why an underlying volume can malfunction, you need to dig deeper to find the root cause. In this article, you will learn how to troubleshoot the incident when you see this error.

Understanding persistent volumes

Persistent Volumes are storage resources created dynamically or statically by administrators, just like any other Kubernetes resource. It has its own life cycle, independent of the individual Pod that uses it. A strict dependency between a Pod and a Persistent Volume prevents normal workload operation.

Once a Persistent Volume object is created, an underlying disk is also created, which, in turn, is attached to the scheduled node and, consequently, mounted on the desired path. When the workload needs to move somewhere else in the cluster, the reverse process occurs by unmounting the volume, detaching it from the node, and moving it to its new destination.

When working with dynamically provisioned volumes in cloud environments (e.g., AWS, Azure, or Google Cloud Platform), it’s not uncommon for Persistent Volume life cycles to be broken, preventing the underlying disk from being correctly detached and attached. This will prevent correct workload scheduling, potentially causing downtime or data loss.

Troubleshooting the error

The Persistent Volume life cycle can be broken for a number of reasons:

node failure
underlying service API call failure
network partition
incorrect access mode (e.g., ReadWriteOnce)
new node already has too many disks attached
new node does not have enough mount points

These issues usually manifest themselves through Pods failing to start and becoming stuck in an endless waiting loop. To help diagnose the issue, you’ll need to describe a Pod and try to understand what’s going on:

shell

Under terminal inline events, you’ll find a series of messages related to the Pod’s life cycle that can help you diagnose the issue.

The failures can generally be divided into two main categories. On one side, there are detach failures, where Kubernetes is unable to detach a disk from a specific node. On the other side, there are attach and mount failures, where Kubernetes can’t attach and/or mount a disk on the new node.

Failedattachvolume

Failedattachvolume occurs when a volume cannot be detached from its current node and attached to a new one. When Kubernetes performs the detach and attach operation, it first checks if the volume is safe to be detached and aborts the operation if the check fails. Also, Kubernetes does not force detach any volume. This error indicates a fundamental failure with the underlying storage infrastructure. The message Volume is already exclusively attached to one node and can’t be attached to another also confirms this. There can be other causes—for example, too many disks attached to a node—but it will be shown in the message.

Failedmount

Failedmount means a volume can’t be mounted on a specific path and can be a consequence of the previous error since the mount operation happens after attach. Because the attach operation fails, the mount timeout expires, meaning the mount operation is not possible. Other reasons can be incorrect device path or device mount path.

Recovering from the failure

Since Kubernetes can’t automatically handle the Failedattachvolume and Failedmount errors on its own, sometimes you have to take manual steps.

Failure to detach

When Kubernetes fails to detach a disk, you can use the storage provider’s CLI or API to detach it manually. For example, when using Azure, you can detach a disk from a virtual machine by running this code:

shell

When using AWS EBS volumes, you can perform the same operation by running this command:

shell

Failure to attach or mount

There may be situations when Kubernetes can detach the volume but is unable to attach or mount the disk in the scheduled node. In this situation, the easiest way to overcome the issue is to force Kubernetes to schedule the workload to another node. This can be done in a few different ways.

Cordon

Cordon marks a node as unschedulable. This means that the Kubernetes Scheduler will not take a cordoned node as an available node. Let’s say you have a Pod scheduled to node-2, but it’s unable to start because the node doesn’t have enough mount points available. The node can be cordoned using kubectl:

shell

And then the Pod can be rescheduled to another node:

shell

Node selectors, affinity, and anti-affinity

Node selectors, affinity, and anti-affinity tell Kubernetes whether to schedule Pods in specific nodes. Nodes will have certain labels that will be used in nodeSelector as well as in affinity and anti-affinity rules to force Pods to be scheduled accordingly.

The simplest mechanism is to use nodeSelector where a node is assigned a label and the Pod is configured with a matching label. For example, if you are sure node-1 can have another disk attached and has enough mount points available, you can run this command:

shell

You can then configure the Pod with the schedule=nginx node selector:

yaml

Final thoughts

Persistent Volumes provide an abstraction that allows Kubernetes workloads to easily provision persistent storage that can survive restarts and scheduling to different nodes. Sometimes the Persistent Volume life cycle is broken and Kubernetes can’t perform rescheduling on its own. Failedattachvolume and Failedmount are two common errors in this situation that mean Kubernetes is unable to detach, reattach, and mount a volume. When this happens, you may need to manually detach a disk or instruct Kubernetes Scheduler to start the Pod in a specific node.

The first step to fixing any issue is to understand it. Unless you are proactively alerted, you’ll have to spend time to find the root cause, using precious time that will be adding to the already ticking downtime, or even worse, data loss.

If you're looking for a streamlined way to monitor Kubernetes workloads and ensure errors are fixed quickly, try using Airplane. With Airplane, you can transform scripts, queries, APIs, and more into custom internal UIs and workflows that can help you monitor incidents.

With Airplane, you can build single or multi-step operations that anyone can use (called Tasks) and customize internal UIs quickly (called Views). Airplane also offers an extensive out-of-the-box template and component library that makes it easy to get started.

To build your first UI that can help monitor your Kubernetes workloads, sign up for a free account or book a demo.

Troubleshooting Kubernetes Failedattachvolume and Failedmount

Understanding persistent volumes

Troubleshooting the error

Failedattachvolume

Failedmount

Recovering from the failure

Failure to detach

Failure to attach or mount

Cordon

Node selectors, affinity, and anti-affinity

Final thoughts

How to use NGINX Prometheus exporter

Collecting logs from AWS Fargate