Only registred users can make comments

Kubernetes Pod Scheduling Explained: Taints, Tolerations, and Node Affinity

Introduction

Managing Kubernetes clusters sometimes involves precise control over pod placement to ensure that workloads are appropriately distributed across certain nodes. Kubernetes offers several ways for this, including taints, tolerations, and node affinity. In this post will explore each method, provide use cases, and demonstrate how to combine them for optimal pod scheduling. 

We’ll tackle a specific problem: scheduling specific pods to specific nodes and ensuring no other pods are placed on these specialized nodes.

Believe me, it's not fully obvious at the first attempt.

Some terminlogy used in the article

Optimized Node:

An optimized node in this article refers to a node that is configured to handle specific types of workloads, such as compute-intensive, memory-intensive, or storage-purpose tasks. These nodes are fine-tuned to provide the necessary resources and environment to ensure optimal performance for the pods they host.

Official terminology 

Taints:

Taints are applied to nodes to repel certain pods, ensuring that only specific pods can be scheduled on those nodes. This helps in maintaining the specialized environment of the node.

Tolerations:

Tolerations are applied to pods, allowing them to be scheduled on nodes with matching taints. This ensures that only the intended pods are placed on the specialized nodes.

Node Affinity:

Node affinity defines rules for pods to prefer or require scheduling on nodes with specific labels. This provides a more flexible and powerful way to control pod placement compared to node selectors.

Case Study Introduction

We will paint up a scenario here.

Kubernetes Taints, tolerations and node affinity

You have a cluster with several nodes dedicated to specific workloads: compute-intensive, memory-intensive, and storage-intensive. In addition, we have some other pods to be scheduled. And there are other nodes too.

node1, node2 and node3 are optimized nodes.

You need to ensure that specific pods are scheduled on these nodes and no other pods are placed on these specialized nodes. These specific pods shall not be placed on other nodes than those that serves the purpose.

We will try to solve this problem by using each method individually and also combine them. The aim here is to find the optimal solution to this challenge and cover the drawbacks.

Attempt 1 -  Using Taints and Tolerations

In this first attempt, we will try to solve our challenge by just using the Taints and Tolerations.

Taints and Tolerations Overview

Taints and tolerations are ways in Kubernetes that allow you to control which pods can be scheduled on which nodes. Taints are applied to nodes to repel pods, while tolerations are applied to pods to allow them to be scheduled on nodes with matching taints.

First, we will taint the nodes:

kubectl taint nodes node1 dedicated=compute:NoSchedule
kubectl taint nodes node2 dedicated=memory:NoSchedule
kubectl taint nodes node3 dedicated=storage:NoSchedule

The taint effect determines what happens to pods that cannot tolerate the taint. There are three effects:

  • NoSchedule: Pods that cannot tolerate this taint will not be scheduled on the node.
  • PreferNoSchedule: The scheduler will try to avoid scheduling pods that cannot tolerate this taint on the node, but it’s not a strict rule.
  • NoExecute: If a pod cannot tolerate this taint, it will be evicted from the node if it is already running.

So far, what we have achieved is, no pods will be scheduled on node1, node2 and node3.

Now, we need to add tolerations to the specific workloads.

compute-pod toleration:

apiVersion: v1
kind: Pod
metadata:
  name: compute-pod
spec:
  tolerations:
  - key: "dedicated"
    operator: "Equal"
    value: "compute"
    effect: "NoSchedule"

memory pod -toleration:

apiVersion: v1
kind: Pod
metadata:
  name: memory-pod
spec:
  tolerations:
  - key: "dedicated"
    operator: "Equal"
    value: "memory"
    effect: "NoSchedule"

storage-pod toleration:

apiVersion: v1
kind: Pod
metadata:
  name: storage-pod
spec:
  tolerations:
  - key: "dedicated"
    operator: "Equal"
    value: "storage"
    effect: "NoSchedule"

The pods will be able to be scheduled on each designated node. Great! 

Now to the problem. Look at the diagram. Storage pod is placed on another node, not the storage optimized node that we wanted.

Pod Scheduling taints and tolerations

While this ensures that only specific pods can be scheduled on the tainted nodes, it doesn’t prevent these pods from being scheduled on other nodes without taints.

❌ Unfortunately, the taints and tolerations didn’t solve the challenge. The special pods we have with the tolerations can still be placed on other nodes than the ones we expect. 

Attempt 2 - Using Node Affinity

Node affinity is a set of rules used by the Kubernetes scheduler to determine which nodes a pod can be scheduled on, based on the labels of the nodes. It’s more flexible than Node Selector, allowing for complex expressions.

Instead of tainting the nodes, we'll label them as follows:

kubectl label nodes node1 dedicated=compute
kubectl label nodes node2 dedicated=memory
kubectl label nodes node3 dedicated=storage

❗Node affinity offers greater flexibility and control over pod scheduling compared to nodeSelector. While nodeSelector only supports simple key-value matching, node affinity allows for complex rules using operators like In, NotIn, and Exists. Additionally, node affinity supports both mandatory (hard) and preferred (soft) constraints, enabling more nuanced and efficient scheduling decisions. This results in better resource utilization and optimized workload management within a Kubernetes cluster.

Example nodeSelector configuration:

apiVersion: v1
kind: Pod
metadata:
  name: ssd-pod
spec:
  nodeSelector:
    disktype: ssd

Let's now add a nodeAffinity rule to the pods.

compute-pod - affinity rule:

apiVersion: v1
kind: Pod
metadata:
  name: compute-pod
spec:
  containers:
  - name: compute-container
    image: compute-container-image
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: dedicated
            operator: In
            values:
            - compute

memory-pod affinity rule:

apiVersion: v1
kind: Pod
metadata:
  name: memory-pod
spec:
  containers:
  - name: memory-container
    image: memory-container-image
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: dedicated
            operator: In
            values:
            - memory

storage-pod affinity rule:

apiVersion: v1
kind: Pod
metadata:
  name: storage-pod
spec:
  containers:
  - name: storage-container
    image: storage-container-image
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: dedicated
            operator: In
            values:
            - storage

There are two types of node affinity constraints:

  1.  requiredDuringSchedulingIgnoredDuringExecution: The scheduler must find a node that satisfies this rule; otherwise, it cannot schedule the Pod. Does not affect running pods.
  2.  preferredDuringSchedulingIgnoredDuringExecution: The scheduler attempts to find a node that satisfies this rule. If no such node is available, the scheduler will still schedule the Pod on another node. With other words, no guarantee that the pod will end up on the node we want.

These configurations ensure that each type of pod can be scheduled on the appropriate node based on node labels, without using tolerations:

  • Compute Pods: Scheduled only on nodes labeled with dedicated=compute.
  • Memory Pods: Scheduled only on nodes labeled with dedicated=memory.
  • Storage Pods: Scheduled only on nodes labeled with dedicated=storage.

This approach places each pod on the nodes best suited for its specific resource requirements.

This is great you might think, this kind of solves the problem.

Look at the diagram and inspect what's happening:

 

Unfortunatelly, the "some-other-pod" is getting scheduled on node2, not really what we expected.

The reason is, while this method ensures that specific pods are directed to the appropriate nodes, it does not prevent other pods from being scheduled on these specialized nodes. As a result, it does not fully solve the problem of exclusive node usage for specific workloads.

❌ Just using node affinity rule didn't solve our challenge.

Attempt 3 - Combined Use of Taints & Tolerations, and Node Affinity

In this last attempt, we will combine everything we have to solve our challenge.

Label the nodes

To be able to use node affinity, we need to label our nodes:

kubectl label nodes node1 dedicated=compute
kubectl label nodes node2 dedicated=memory
kubectl label nodes node3 dedicated=storage

Add Taint to the nodes

Since we also use the taints and tolerations, we'll taint the nodes:

kubectl taint nodes node1 dedicated=compute:NoSchedule
kubectl taint nodes node2 dedicated=memory:NoSchedule
kubectl taint nodes node3 dedicated=storage:NoSchedule

Pods without the necessary tolerations will not be scheduled on these nodes.

Add tolerations to the pods

To tolerate the taints, the pods need to have tolerations.

compute-pod toleration config:

apiVersion: v1
kind: Pod
metadata:
  name: compute-pod
spec:
  tolerations:
  - key: "dedicated"
    operator: "Equal"
    value: "compute"
    effect: "NoSchedule"

memory-pod toleration config:

apiVersion: v1
kind: Pod
metadata:
  name: memory-pod
spec:
  tolerations:
  - key: "dedicated"
    operator: "Equal"
    value: "memory"
    effect: "NoSchedule"

storage-pod toleration config:

apiVersion: v1
kind: Pod
metadata:
  name: storage-pod
spec:
  tolerations:
  - key: "dedicated"
    operator: "Equal"
    value: "storage"
    effect: "NoSchedule"

Add node affinity rules

Lastly, we will add affinity rules to our pods. We will keep the toleration from the previous step.

compute-pod:

apiVersion: v1
kind: Pod
metadata:
  name: compute-pod
spec:
  containers:
  - name: compute-container
    image: compute-container-image
  tolerations:
  - key: "dedicated"
    operator: "Equal"
    value: "compute"
    effect: "NoSchedule"
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: dedicated
            operator: In
            values:
            - compute

memory-pod:

apiVersion: v1
kind: Pod
metadata:
  name: memory-pod
spec:
  containers:
  - name: memory-container
    image: memory-container-image
  tolerations:
  - key: "dedicated"
    operator: "Equal"
    value: "memory"
    effect: "NoSchedule"
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: dedicated
            operator: In
            values:
            - memory

storage-pod:

apiVersion: v1
kind: Pod
metadata:
  name: storage-pod
spec:
  containers:
  - name: storage-container
    image: storage-container-image
  tolerations:
  - key: "dedicated"
    operator: "Equal"
    value: "storage"
    effect: "NoSchedule"
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: dedicated
            operator: In
            values:
            - storage

What have we achieved here?

When combining all three techniques here, we actually achieve what we want.

Inspect the diagram below:

combine taint, toleration and node affinity

By combining taints, tolerations, and node affinity, we've ensure that:

  • Only the specified pods are scheduled on the optimized nodes.
  • No other pods are scheduled on these nodes.
  • The scheduler is guided to place the pods on the most suitable nodes based on their requirements.

✅ Great, that’s exactly what we wanted. Combining taints, tolerations, and node affinity rules ensured that the pods were scheduled correctly.

Conclusion

Combining taints, tolerations, and node affinity in Kubernetes provides control over pod placement, ensuring that workloads are distributed based on the requirements of each pod and node. With the covered implementation, we can ensure that critical applications receive the resources they need. 

Comments