Only registred users can make comments

Kubernetes Scheduling - Learn Affinity & Anti-Affinity

 

Kubernetes Scheduling - Learn Affinity & Anti-Affinity

In a  Kubernetes cluster, there is a process called kube-scheduler which is responsible for matching a Pod with a Node. This process will determine which Node is the best fit based on the combined sum, or highest score if you wish, of things like resource limits (set in the pod spec), the node constraints, affinity, anti-affinity rules, taints and tolerations, etc.

The node that gets the highest score will be selected and the Pod will be placed in a scheduling queue. The next that will happen is the kubelet agent running on the node will make sure that the Pod gets up and running. 

Usually, this is pretty transparent to the users and you don't need to make any effort of controlling this process. Behind the scenes, it's an intense computation going on in the cluster to determine where the Pod should be scheduled.

There are use cases when you need to control where a Pod should be scheduled.

For instance, you may want the Pod to get scheduled on a specific node type. 

Other requirements could be that a Pod needs a specific storage type, to run in a particular availability zone, to be placed on a GPU node, to run alongside another application that runs in another pod on a specific node, etc.

In this article, we will learn how we can control the schedule of the Pods to different nodes based on different criteria.

Requirements

This article assumes that you have a running cluster. It won't instruct you on how to set up a Kubernetes cluster. 

In the examples below, a Linode Kubernetes Engine (LKE) is being used, but you can run any cluster you want. It can be a local one like minikube, k3d, or Minishift, but you can also run it in the public cloud using Kubernetes services like EKS, GKE, AKS, or like me LKE. The most important thing is that you have full access to the cluster so you can run kubectlcommands. 

The article won't teach you the fundamentals of kubectl CLI, but even if you have limited knowledge of this kubectl, you should be fine by just following along. 

Labels and Selectors

Before going further, let's learn about labels and selectors.

Labels in Kubernetes are key-value pairs and are very handy for grouping and managing resources.  
With labels, you basically add metadata to objects and resources. If you work as a Kubernetes administrator, this could make your life easier since you will be able to group and manage your resources in a good way.

Here are some examples of labels that you could define:

  • owner=snoopy
  • env=dev
  • env=prod
  • disktype=ssd

The following list shows the requirements for the labels:

  • supports 63 characters or less
  • can be empty
  • if not empty, the label must begin and end with an alphanumeric character ([a-z0-9A-Z]),
  • can contain:
    • dashes (-)
    • alphanumerics
    • dots (.)

You can list objects in Kubernetes and their labels. In this example, we will list nodes and corresponding labels (there are many labels assigned to the nodes by default):

$ kubectl get nodes --show-labels
NAME                           STATUS   ROLES    AGE   VERSION   LABELS
lke68498-106277-62f385c4572c   Ready    <none>   51m   v1.23.6   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=g6-standard-1,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=eu-central,kubernetes.io/arch=amd64,kubernetes.io/hostname=lke68498-106277-62f385c4572c,kubernetes.io/os=linux,lke.linode.com/pool-id=106277,node.kubernetes.io/instance-type=g6-standard-1,topology.kubernetes.io/region=eu-central,topology.linode.com/region=eu-central
lke68498-106277-62f385c4b3c0   Ready    <none>   51m   v1.23.6   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=g6-standard-1,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=eu-central,kubernetes.io/arch=amd64,kubernetes.io/hostname=lke68498-106277-62f385c4b3c0,kubernetes.io/os=linux,lke.linode.com/pool-id=106277,node.kubernetes.io/instance-type=g6-standard-1,topology.kubernetes.io/region=eu-central,topology.linode.com/region=eu-central
lke68498-106277-62f385c51015   Ready    <none>   50m   v1.23.6   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=g6-standard-1,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=eu-central,kubernetes.io/arch=amd64,kubernetes.io/hostname=lke68498-106277-62f385c51015,kubernetes.io/os=linux,lke.linode.com/pool-id=106277,node.kubernetes.io/instance-type=g6-standard-1,topology.kubernetes.io/region=eu-central,topology.linode.com/region=eu-central

and you can list the nodes that have specific keys (you can add many keys):

$ kubectl get nodes -L=env,az


NAME                           STATUS   ROLES    AGE   VERSION   ENV    AZ
lke68498-106277-62f385c4572c   Ready    <none>   22h   v1.23.6   prod   az-1
lke68498-106277-62f385c4b3c0   Ready    <none>   22h   v1.23.6   dev    az-2
lke68498-106277-62f385c51015   Ready    <none>   22h   v1.23.6          az-2

Selectors allow you to filter the objects based on labels that are assigned to that object. Most third-party operators are performing filtering to address changes on the object level. Also internally, Kubernetes engine is using selectors to filter objects and resources in the internal processes. And you as the Kubernetes administrator can do the same to structure your resources and objects in a good way. For instance, your automation can make changes to objects based on the labels. 

Labels and selectors can manage objects like:

  • Pods
  • Deployments
  • Nodes
  • Services
  • Secrets
  • Ingress Resource
  • Namespaces

In the following example, we have a  matchLabels selector that holds a map of key-value pairs.

These key-value pairs can then be used when you write a matchExpressions in rules which we will learn in soon.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis-cache
spec:
  selector:
    matchLabels:
      app: cache

Example of matchExpressionswhere we're specifying a PodAntiAffinity rule based on the app: cache key-value pair:

spec:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values:
            - cache

You could also write simple queries using the selector flag:

 $ kubectl get pods -o wide --selector app=cache

NAME                           READY   STATUS    RESTARTS   AGE   IP          NODE                           NOMINATED NODE   READINESS GATES
redis-cache-57885c55ff-4br9m   0/1     Pending   0          11h   <none>      <none>                         <none>           <none>
redis-cache-57885c55ff-xcg9g   1/1     Running   0          11h   10.2.0.14   lke68498-106277-62f385c4b3c0   <none>           <none>
redis-cache-57885c55ff-xv69b   1/1     Running   0          11h   10.2.2.11   lke68498-106277-62f385c51015   <none>           <none>

Another way to list the nodes that output app as a column with a value set to cache

kubectl get nodes \
  --label-columns=app \
  --selector=cache

NodeSelector

This is the simplest way to control where a Pod will be scheduled. Let's have a look at a real-life scenario.

Use Case - nodeSelector - based on node labels

We need to have a pod scheduled on a node that has a particular label. First, let's imagine that you have a big cluster where you run both dev and production workloads. In this scenario, we have nodes that only host dev-related workloads.

Those nodes have fewer resources (CPU and Memory) available compared to the prod nodes.

The labels can look like the following:

  • env=dev
  • env=prod

The following diagram illustrates the placement of a Pod based on a label belonging to the worker node named n-1:


Resuming the use case scenario:

To fulfill the requirement, we need to add an appropriate label to the node(s). We will apply a Pod manifest to our cluster and state a nodeSelector env: dev

First, let's find out which worker nodes are running in our cluster:

$ kubectl get nodes
NAME                           STATUS   ROLES    AGE   VERSION
lke68498-106277-62f385c4572c   Ready    <none>   33m   v1.23.6
lke68498-106277-62f385c4b3c0   Ready    <none>   34m   v1.23.6
lke68498-106277-62f385c51015   Ready    <none>   33m   v1.23.6

If you run kubectl get nodes --show-labels you will get all nodes with all labels assigned to them

Example - list the nodes and labels:

$ kubectl get nodes --show-labels
NAME                           STATUS   ROLES    AGE   VERSION   LABELS
lke68498-106277-62f385c4572c   Ready    <none>   51m   v1.23.6   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=g6-standard-1,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=eu-central,kubernetes.io/arch=amd64,kubernetes.io/hostname=lke68498-106277-62f385c4572c,kubernetes.io/os=linux,lke.linode.com/pool-id=106277,node.kubernetes.io/instance-type=g6-standard-1,topology.kubernetes.io/region=eu-central,topology.linode.com/region=eu-central
lke68498-106277-62f385c4b3c0   Ready    <none>   51m   v1.23.6   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=g6-standard-1,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=eu-central,kubernetes.io/arch=amd64,kubernetes.io/hostname=lke68498-106277-62f385c4b3c0,kubernetes.io/os=linux,lke.linode.com/pool-id=106277,node.kubernetes.io/instance-type=g6-standard-1,topology.kubernetes.io/region=eu-central,topology.linode.com/region=eu-central
lke68498-106277-62f385c51015   Ready    <none>   50m   v1.23.6   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=g6-standard-1,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=eu-central,kubernetes.io/arch=amd64,kubernetes.io/hostname=lke68498-106277-62f385c51015,kubernetes.io/os=linux,lke.linode.com/pool-id=106277,node.kubernetes.io/instance-type=g6-standard-1,topology.kubernetes.io/region=eu-central,topology.linode.com/region=eu-central

Let's imagine that the node <lke68498-106277-62f385c51015> is related to the dev environment.

It makes sense that we add a label like env=dev

# add the label to the node
kubectl label nodes lke68498-106277-62f385c51015 env=dev

We could list the nodes having env=dev label with a selector parameter (we only have one):

$ kubectl get nodes --selector env=dev

Output:
NAME                           STATUS   ROLES    AGE   VERSION
lke68498-106277-62f385c51015   Ready    <none>   56m   v1.23.6

We'll deploy a Pod on the fly by applying the following Pod Manifest:

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: nginx
  labels:
    env: dev
spec:
  containers:
  - name: nginx
    image: nginx
    imagePullPolicy: IfNotPresent
  nodeSelector:
    env: dev
EOF

❗When specifying labels in your manifests, you need to write those in the following format: key: value


The result:

$ kubectl get pods -o wide

NAME    READY   STATUS    RESTARTS   AGE   IP         NODE                           NOMINATED NODE   READINESS GATES
nginx   1/1     Running   0          84s   10.2.2.2   lke68498-106277-62f385c51015   <none>           <none>

As we can see, the pod got scheduled on the node: lke68498-106277-62f385c51015

This was the simplest way to schedule a Pod on a specific node. In the following section, we will learn about affinity and anti-affinity rules.

Affinity and anti-affinity

The affinity and anti-affinity provide even more control compared to the nodeSelector, which is the simplest way of controlling assignment of a pod to a node.

The following are characteristics of the affinity rules: 

  • write more expressive rules
  • preference rules (instead of hard rules)
  • base rules on other pods labels (not just the node labels)

There are two types of affinity:

  1. Node affinity/anti-affinity
  2. Pod affinity/anti-affinity

Node affinity/anti-affinity

As with labels, Node affinity is a set of rules the scheduler follow to decide where a Pod should be scheduled. 

It's based on labels applied to a node. Now, the question is, how is this different from the nodeSelector manner that we've used in the previous example?

What the rule in the Pod spec is saying, is the requirement is to schedule the pod on a node that has a env=dev key-value pair. 

If we look at our cluster, we can see that only two nodes are likely to become candidates.

Now, in the Pod spec, it's also specified that the preferred node instance type is t2.medium (picked from EC2 AWS instance type list for this simple purpose). 

The following diagram illustrates the scenario described. If the worker node called n-2 has enough resources and no constraints, the Pod will most likely be scheduled on this node.

An important thing to remember is, it's not guaranteed that the kube-scheduler will schedule the pod  on n-2 node,  actually the Pod could end up on a worker node called n-1

Worker node n-3 will never be considered as an alternative to host our Pod since it doesn't match env=dev label in out matchExpressions.

 

Type of rules

  1. required <requiredDuringSchedulingIgnoredDuringExecution>: This rule will be enforced. If the rule cannot be met, the Pod will NOT be scheduled.
  2. preferred <preferredDuringSchedulingIgnoredDuringExecution>: the scheduler will try to apply the rule, but it's not guaranteed and the Pod WILL BE scheduled on another node if the rule cannot be enforced. 

The required rule is the so-called hard rule and the condition must be met before the Pod gets scheduled.

The preferred is the so-called soft rule, which means it will only schedule the Pod if possible, but not guaranteed. In case no node can meet the requirement, the Pod will be scheduled on another node.

To understand the preferred rule better, imagine that you want to schedule a Pod to a less expensive node type. The application running in that Pod is not requiring extensive resources in terms of CPU and Memory. But it has to run somewhere....so you may add a rule saying that the preferred node type is a spot instance or something very cheap. If the node type specified is not available, the Pod can be scheduled on any other node type. So the scheduler will never guarantee that the Pod gets scheduled on your preferred choice, but that could be totally fine. By doing so, the chances are greater that the Pod gets scheduled fairly quickly.


❗Understand the word IgnoredDuringExecution in requiredDuringSchedulingIgnoredDuringExecution or preferredDuringSchedulingIgnoredDuringExecution

The Pod will continue to run on a node, even if the rule is not valid anymore. The Pod won't be affected.

E.g. if you remove a label from a node where the Pod is running, the Pod will still continue to run on that node until the end of its full lifetime. 


Use Case -nodeAffinity - required rule - <requiredDuringSchedulingIgnoredDuringExecution>

In big Kubernetes cluster environments, it's pretty common that the cluster worker nodes span over several regions and availability zones. In our imaginary environment, our cluster spans over three availability zones.

We have a web app that needs to be placed close to a Postgres DB that is running in Availability Zone 1. The reason is to minimize network latency that can impact customer satisfaction.

Each node in our cluster has a label that tells in which availability zone it is running in. Example:

  • node n-1: az=az-1
  • node n-2: az=az-2
  • node n-3: az=az-3

In this use case, the requirement is to schedule a Pod on a node that is located in a specific availability zone. This should be labeled as az=az-1.

 

We can fake this a little bit since not every cloud provider and especially not the local cluster has availability zones or regions. For that reason, we will create our own labels specifying different availability zones.

We will deploy an Nginx pod to our cluster with the required affinity rule that will fulfill the use-case's requirement.

In my cluster, I have the following nodes:

$ kubectl get nodes

NAME                           STATUS   ROLES    AGE    VERSION
lke68498-106277-62f385c4572c   Ready    <none>   127m   v1.23.6
lke68498-106277-62f385c4b3c0   Ready    <none>   128m   v1.23.6
lke68498-106277-62f385c51015   Ready    <none>   126m   v1.23.6

We will assign a label to a node. We can do that with the following command:

$ kubectl label nodes lke68498-106277-62f385c4572c az=az-1

Output:
node/lke68498-106277-62f385c4572c labeled

 We can now check if  the node(s) has a specific label by using a selector flag:

$ kubectl get nodes --selector az=az-1

NAME                           STATUS   ROLES    AGE    VERSION
lke68498-106277-62f385c4572c   Ready    <none>   130m   v1.23.6

We will define and apply a Pod manifest, but before we do that, we will have a look at the affinity rule that will be part of the manifest.

The affinity definition:

spec:
  affinity:
    nodeAffinity: # nodeAffinity is a pointer to a struct that contains a list of required and optional node affinity scheduling rules.
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: az # key is the type of node affinity rule.
            operator: In # operator is the type of comparison to perform.
            values:
            - az-1 # values is the list of values to compare with the node's labels

This is a rule of type <required>, so the key-value pair, as=az-1is required to be found on the node before the scheduler is able to schedule the Pod with this rule in its Pod spec.

Now let's deploy our Nginx app with an affinity rule of type <required>:

cat <<EoF | kubectl apply -f - 
apiVersion: v1
kind: Pod
metadata:
  name: example-node-affinity
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: az
            operator: In
            values:
            - az-1
  containers:
  - name: example-node-affinity
    image: nginx
EoF

We expect the Nginx pod to get scheduled on the node lke68498-106277-62f385c4572c since it (the node) holds the label az=az-1

With the following command, we will get the status of our Pod. With -o wide flag, we can also get the node where the Pod is running:

$ kubectl get pod example-node-affinity -o wide
NAME                    READY   STATUS    RESTARTS   AGE   IP         NODE                           NOMINATED NODE   READINESS GATES
example-node-affinity   1/1     Running   0          86s   10.2.1.2   lke68498-106277-62f385c4572c   <none>           <none>

Great, our Pod got scheduled to the correct node according to the required rule that we have defined. This was not much different from the nodeSelector case.

But let's flip this case over a little bit and have a look at the following use case.

Clear Out

let's delete the label called az=1 from the node lke68498-106277-62f385c4572c

$ kubectl label nodes lke68498-106277-62f385c4572c az-

❗A tip - you can delete labels from a node with a dash (-)

e.g.kubectl label nodes lke68498-106277-62f385c4572c az-


Use Case -nodeAffinity - preferred rule - <requiredDuringSchedulingIgnoredDuringExecution>

This slightly different use case will be similar to the previous one, with some small but significant differences.

Our application should preferably be running in a zone called az-1 since that zone also hosts a Postgres DB and the latency should be minimal.

If the nodes in az-1 availability zone are not available or have constraints of some sort, the Pod should run on any other node regardless of in which availability zone the nodes are placed running in.